-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
podresources: add Watch endpoint #1926
Changes from all commits
0c6948a
86596df
c4b38e3
45dd5d9
1346c2a
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -60,12 +60,18 @@ In this document we will discuss the motivation and code changes required for in | |
|
||
## Changes | ||
|
||
Add a v1alpha1 Kubelet GRPC service, at `/var/lib/kubelet/pod-resources/kubelet.sock`, which returns information about the kubelet's assignment of devices to containers. It obtains this information from the internal state of the kubelet's Device Manager. The GRPC Service returns a single PodResourcesResponse, which is shown in proto below: | ||
Add a v1alpha1 Kubelet GRPC service, at `/var/lib/kubelet/pod-resources/kubelet.sock`, which returns information about the kubelet's assignment of devices to containers. It obtains this information from the internal state of the kubelet's Device Manager. | ||
The GRPC Service exposes two endpoints: | ||
- `List`, which returns a single PodResourcesResponse, enabling monitor applications to poll for resources allocated to pods and containers on the node. | ||
- `Watch`, which returns a stream of PodResourcesResponse, enabling monitor applications to be notified of new resource allocation, release or resource allocation updates, using the `action` field in the response. | ||
|
||
This is shown in proto below: | ||
```protobuf | ||
// PodResources is a service provided by the kubelet that provides information about the | ||
// node resources consumed by pods and containers on the node | ||
service PodResources { | ||
rpc List(ListPodResourcesRequest) returns (ListPodResourcesResponse) {} | ||
rpc Watch(WatchPodResourcesRequest) returns (stream WatchPodResourcesResponse) {} | ||
} | ||
|
||
// ListPodResourcesRequest is the request made to the PodResources service | ||
|
@@ -76,11 +82,27 @@ message ListPodResourcesResponse { | |
repeated PodResources pod_resources = 1; | ||
} | ||
|
||
// WatchPodResourcesRequest is the request made to the Watch PodResourcesLister service | ||
message WatchPodResourcesRequest {} | ||
|
||
enum WatchPodAction { | ||
ADDED = 0; | ||
DELETED = 1; | ||
} | ||
|
||
// WatchPodResourcesResponse is the response returned by Watch function | ||
message WatchPodResourcesResponse { | ||
WatchPodAction action = 1; | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I wonder if just exposing the pod |
||
string uid = 2; | ||
repeated PodResources pod_resources = 3; | ||
} | ||
|
||
// PodResources contains information about the node resources assigned to a pod | ||
message PodResources { | ||
string name = 1; | ||
string namespace = 2; | ||
repeated ContainerResources containers = 3; | ||
int64 resource_version = 4; | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. is this the pod resource version as stored in etcd or something else? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yes, that's the intention, in order to enable client code to reconcile data from |
||
} | ||
|
||
// ContainerResources contains information about the resources assigned to a container | ||
|
@@ -96,11 +118,34 @@ message ContainerDevices { | |
} | ||
``` | ||
|
||
### Consuming the Watch endpoint in client applications | ||
|
||
Using the `Watch` endpoint, client applications can be notified of the pod resource allocation changes as soon as possible. | ||
However, the state of a pod will not be sent up until the first resource allocation change, which is the pod deletion in the worst case. | ||
Client applications who need to have the complete resource allocation picture thus need to consume both `List` and `Watch` endpoints. | ||
|
||
The `resourceVersion` found in the responses of both APIs allows client applications to identify the most recent information. | ||
The `resourceVersion` value is updated following the same semantics of pod `resourceVersion` value, and the implementation | ||
may use the same value from the corresponding pods. | ||
To keep the implementation simple as possible, the kubelet does *not* store any historical list of changes. | ||
|
||
In order to make sure not to miss any updates, client application can: | ||
1. call the `Watch` endpoint to get a stream of changes. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. what triggers watch events from getting emitted in kubelet code flows? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. After some experiments, I think ADDED should be triggered after succesfull allocation from topology manager (https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/cm/topologymanager/topology_manager.go#L232) |
||
2. call the `List` endpoint to get the state of all the pods in the node. | ||
3. reconcile updates using the `resourceVersion`. | ||
|
||
In order to make the resource accounting on the client side, safe and easy as possible the `Watch` implementation | ||
will guarantee ordering of the event delivery in such a way that the capacity invariants are always preserved, and the value | ||
will be consistent after each event received - not only at steady state. | ||
Consider the following scenario with 10 devices, all allocated: pod A with device D1 allocated gets deleted, then | ||
pod B starts and gets device D1 again. In this case `Watch` will guarantee that `DELETE` and `ADDED` events are delivered | ||
in the correct order. | ||
|
||
### Potential Future Improvements | ||
|
||
* Add `ListAndWatch()` function to the GRPC endpoint so monitoring agents don't need to poll. | ||
* Add identifiers for other resources used by pods to the `PodResources` message. | ||
* For example, persistent volume location on disk | ||
* Implement historical list of changes, allowing client applications to call `List` and `Watch` endpoints in a more natural order. | ||
|
||
## Alternatives Considered | ||
|
||
|
@@ -164,6 +209,7 @@ Beta: | |
|
||
## Implementation History | ||
|
||
- 2020-10-01: KEP extended with Watch API | ||
- 2018-09-11: Final version of KEP (proposing pod-resources endpoint) published and presented to sig-node. [Slides](https://docs.google.com/presentation/u/1/d/1xz-iHs8Ec6PqtZGzsmG1e68aLGCX576j_WRptd2114g/edit?usp=sharing) | ||
- 2018-10-30: Demo with example gpu monitoring daemonset | ||
- 2018-11-10: KEP lgtm'd and approved | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when is each action emitted? can you clarify when modified would be used in life of pod?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actions should be emitted:
I'll document better in the KEP text.
In Hindsight we most likely don't need MODIFED, will just remove it.