-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Topology aware resource provisioning daemon #1870
Closed
AlexeyPerevalov
wants to merge
5
commits into
kubernetes:master
from
AlexeyPerevalov:provisioning-resources-with-numa-topology
Closed
Changes from all commits
Commits
Show all changes
5 commits
Select commit
Hold shift + click to select a range
2584202
Provisioning the Node Resources With NUMA Topology Information
AlexeyPerevalov cc9c3a1
CRI update
swatisehgal eb19c1f
NodeTopologyResource ClusterRole and ClusterRoleBinding
swatisehgal 5997089
Polish the KEP to capture recent changes after reviews and community …
swatisehgal 6a8d773
Capturing Test plan in the KEP
swatisehgal File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
278 changes: 278 additions & 0 deletions
278
keps/sig-node/2051-provisioning-resources-with-numa-topology/README.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,278 @@ | ||
# Exposure Node Resources With NUMA Topology Information | ||
|
||
## Table of Contents | ||
|
||
<!-- toc --> | ||
- [Summary](#summary) | ||
- [Motivation](#motivation) | ||
- [Goals](#goals) | ||
- [Non-Goals](#non-goals) | ||
- [Proposal](#proposal) | ||
- [Design Details](#design-details) | ||
- [Design based on podresources interface of the kubelet](#design-based-on-podresources-interface-of-the-kubelet) | ||
- [API](#api) | ||
- [Integration into Node Feature Discovery](#integration-into-node-feature-discovery) | ||
- [Graduation Criteria](#graduation-criteria) | ||
- [Test Plan](#test-plan) | ||
- [Implementation History](#implementation-history) | ||
- [Alternatives](#alternatives) | ||
- [Annotation approach](#annotation-approach) | ||
- [NUMA specification in ResourceName](#numa-specification-in-resourcename) | ||
- [Design based on CRI](#design-based-on-cri) | ||
- [Drawbacks](#drawbacks) | ||
<!-- /toc --> | ||
|
||
## Summary | ||
|
||
Kubernetes clusters composed of nodes with complex hardware topology are becoming more prevalent. | ||
[Topology Manager](https://kubernetes.io/docs/tasks/administer-cluster/topology-manager/) was | ||
introduced in kubernetes as part of kubelet in order to extract the best performance out of | ||
these high performance hybrid systems. It performs optimizations related to resource allocation | ||
in order to make it more likely for a given pod to perform optimally. In scenarios where | ||
Topology Manager is unable to align topology of requested resources based on the selected | ||
Topology Manager policy, the pod is rejected with Topology Affinity Error. | ||
[This](https://github.com/kubernetes/kubernetes/issues/84869) kubernetes issue provides | ||
further context on how runaway pods are created because the scheduler is topology-unaware. | ||
|
||
In order to address this issue, scheduler needs to choose a node considering resource availability along with underlying resource topology and Topology Manager policy on the worker node. | ||
|
||
## Motivation | ||
|
||
In order to enable topology aware scheduling, resource topology information of the nodes in the cluster | ||
needs to be exposed. This KEP describes how it would be implemented. | ||
|
||
### Goals | ||
|
||
Provisioning resources with topology information. | ||
|
||
### Non-Goals | ||
|
||
- modification of any public API | ||
- improving and as a result modification of the TopologyManager and its policies | ||
|
||
## Proposal | ||
|
||
Add ability to expose resource information of the pod with NUMA topology into Node Feature | ||
Discovery [daemon](https://github.com/kubernetes-sigs/node-feature-discovery). | ||
|
||
## Design Details | ||
|
||
The design consists of part which describes how datum collected and how it was provided. | ||
|
||
Resources used by the pod could be obtained by [podresources](https://github.com/kubernetes/enhancements/blob/master/keps/sig-node/compute-device-assignment.md) interface of the kubelet. | ||
|
||
To calculate available resources need to know all resources | ||
which could be used by kubernetes. It could be calculated by | ||
subtracting resources of kube cgroup and system cgroup from | ||
system allocatable resources. | ||
|
||
|
||
### Design based on podresources interface of the kubelet | ||
|
||
Podresources interface of the kubelet is described in | ||
|
||
[pkg/kubelet/apis/podresources/v1alpha1/api.proto](https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/apis/podresources/v1alpha1/api.proto) | ||
|
||
it is available for every process on the worker node by | ||
unix domain socket situated by the following path: | ||
|
||
```go | ||
filepath.Join(kl.getRootDir(), config.DefaultKubeletPodResourcesDirName) | ||
``` | ||
|
||
it could be used to collect used resources on the worker node and to evaluate | ||
its NUMA assignment (by device id). | ||
|
||
Podresources could also be used to obtain initial information on resources of the worker node. | ||
|
||
The PodResource API as it stands today: | ||
* only provides information from Device Manager but not from CPU Manager. | ||
* doesn't contain topology information as part of ContainerDevice. | ||
* doesn't have the capability to let clients enumerate the resources. | ||
|
||
This [KEP](https://github.com/kubernetes/enhancements/pull/1884) proposes extension of podresource api to address the above mentioned gaps. | ||
|
||
With the changes proposed in the above KEP, this interface might look like as following: | ||
|
||
```proto | ||
syntax = "proto3"; | ||
|
||
package v1alpha1; | ||
|
||
|
||
service PodResources { | ||
rpc List(ListPodResourcesRequest) returns (ListPodResourcesResponse) {} | ||
rpc GetAvailableResources(AvailableResourcesRequest) returns (AvailableResourcesResponse) {} | ||
} | ||
|
||
message ListPodResourcesRequest {} | ||
|
||
message ListPodResourcesResponse { | ||
repeated PodResources pod_resources = 1; | ||
} | ||
|
||
message AvailableResourcesRequest {} | ||
|
||
message AvailableResourcesResponse { | ||
repeated ContainerDevices devices = 1; | ||
repeated int64 cpu_ids = 2; | ||
} | ||
|
||
message ContainerDevices { | ||
string resource_name = 1; | ||
repeated string device_ids = 2; | ||
Topology topology = 3; | ||
} | ||
``` | ||
|
||
### API | ||
|
||
Available resources with topology of the node should be stored in CRD. Format of the topology described | ||
[in this document](https://docs.google.com/document/d/12kj3fK8boNuPNqob6F_pPU9ZTaNEnPGaXEooW1Cilwg/edit). | ||
|
||
|
||
```go | ||
// NodeResourceTopology is a specification for a Foo resource | ||
type NodeResourceTopology struct { | ||
metav1.TypeMeta `json:",inline"` | ||
metav1.ObjectMeta `json:"metadata,omitempty"` | ||
|
||
TopologyPolicies []string `json:"topologyPolicies"` | ||
Zones ZoneMap `json:"zones"` | ||
} | ||
|
||
// Zone is the spec for a NodeResourceTopology resource | ||
type Zone struct { | ||
Type string `json:"type"` | ||
Parent string `json:"parent,omitempty"` | ||
Costs map[string]int `json:"costs,omitempty"` | ||
Attributes map[string]int `json:"attributes,omitempty"` | ||
Resources ResourceInfoMap `json:"resources,omitempty"` | ||
} | ||
|
||
type ResourceInfo struct { | ||
Allocatable string `json:"allocatable"` | ||
Capacity string `json:"capacity"` | ||
} | ||
|
||
type ZoneMap map[string]Zone | ||
type ResourceInfoMap map[string]ResourceInfo | ||
``` | ||
|
||
The code for working with it is generated by https://github.com/kubernetes/code-generator.git | ||
One CRD instance contains information of available resources of the appropriate worker node. | ||
|
||
|
||
### Integration into Node Feature Discovery | ||
|
||
In order to allow the NFD-master Daemon to create, get, update, delete NodeResourceTopology CRD instances, ClusterRole and ClusterRoleBinding would have to be configured as below: | ||
|
||
``` yaml | ||
apiVersion: rbac.authorization.k8s.io/v1 | ||
kind: ClusterRole | ||
metadata: | ||
name: noderesourcetopology-handler | ||
rules: | ||
- apiGroups: ["topology.node.k8s.io"] | ||
resources: ["noderesourcetopologies"] | ||
verbs: ["*"] | ||
- apiGroups: ["rbac.authorization.k8s.io"] | ||
resources: ["*"] | ||
verbs: ["*"] | ||
--- | ||
apiVersion: rbac.authorization.k8s.io/v1 | ||
kind: ClusterRoleBinding | ||
metadata: | ||
name: handle-noderesourcetopology | ||
subjects: | ||
- kind: ServiceAccount | ||
name: noderesourcetopology-account | ||
namespace: default | ||
roleRef: | ||
kind: ClusterRole | ||
name: noderesourcetopology-handler | ||
apiGroup: rbac.authorization.k8s.io | ||
--- | ||
apiVersion: v1 | ||
kind: ServiceAccount | ||
metadata: | ||
name: noderesourcetopology-account | ||
``` | ||
|
||
`serviceAccountName: noderesourcetopology-account` would have to be added to the manifest file of the Daemon. | ||
|
||
### Graduation Criteria | ||
|
||
* The feature has been pushed to Node feature discovery. | ||
* The feature has been stable and reliable in the past several releases. | ||
* Documentation should exist for the feature. | ||
* Test coverage of the feature is acceptable. | ||
|
||
### Test Plan | ||
|
||
* Unit test coverage. | ||
* E2E tests would be added to Node Feature Discovery repository. | ||
|
||
## Implementation History | ||
|
||
- 2020-06-22: Initial KEP published. | ||
- 2020-09-16: Updated to capture flexible/generic CRD specification. Moved design based on CRI as to the alternatives section because of its drawbacks. | ||
- 2020-09-29: Capturing the test plan for this feature. | ||
|
||
## Alternatives | ||
|
||
The provisioning of the resourcees could be implemented also by another way. | ||
Daemon can keep resources in node annotation or in the pod's annotation. | ||
Also kubelet can provide additional resources with NUMA information in ResourceName. | ||
|
||
### Annotation approach | ||
|
||
Annotation of the node or pod it's yet another place for arbitrary information. | ||
|
||
This approach doesn't have known side effects. | ||
|
||
|
||
### NUMA specification in ResourceName | ||
|
||
The representation of resource consists of two parts subdomain/resourceName. Where | ||
subdomain could be omitted. Subdomain contains vendor name. It doesn't suit well for | ||
reflecting NUMA information of the node as well as / delimeter since subdomain is optional. | ||
So new delimiter should be introduced to separate it from subdomain/resourceName. | ||
|
||
It might look like: | ||
numa%d///subdomain/resourceName | ||
|
||
%d - number of NUMA node | ||
/// - delimeter | ||
numa%d/// - could be omitted | ||
|
||
This approach may have side effects. | ||
|
||
### Design based on CRI | ||
|
||
The containerStatusResponse returned as a response to the ContainerStatus rpc contains `Info` field which is used by the container runtime for capturing ContainerInfo. | ||
```go | ||
message ContainerStatusResponse { | ||
ContainerStatus status = 1; | ||
map<string, string> info = 2; | ||
} | ||
``` | ||
|
||
Containerd has been used as the container runtime in the initial investigation. The internal container object info | ||
[here](https://github.com/containerd/cri/blob/master/pkg/server/container_status.go#L130) | ||
|
||
The Daemon set is responsible for the following: | ||
|
||
- Parsing the info field to obtain container resource information | ||
- Identifying NUMA nodes of the allocated resources | ||
- Identifying total number of resources allocated on a NUMA node basis | ||
- Detecting Node resource capacity on a NUMA node basis | ||
- Updating the CRD instance per node indicating available resources on NUMA nodes, which is referred to the scheduler | ||
|
||
|
||
#### Drawbacks | ||
|
||
The content of the `info` field is free form, unregulated by the API contract. So, CRI-compliant container runtime engines are not required to add any configuration-specific information, like for example cpu allocation, here. In case of containerd container runtime, the Linux Container Configuration is added in the `info` map depending on the verbosity setting of the container runtime engine. | ||
|
||
There is currently work going on in the community as part of the the Vertical Pod Autoscaling feature to update the ContainerStatus field to report back containerResources | ||
[KEP](https://github.com/kubernetes/enhancements/blob/master/keps/sig-node/20191025-kubelet-container-resources-cri-api-changes.md). |
22 changes: 22 additions & 0 deletions
22
keps/sig-node/2051-provisioning-resources-with-numa-topology/kep.yaml
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,22 @@ | ||
--- | ||
title: Provisioning the Node Resources With NUMA Topology Information | ||
authors: | ||
- "@AlexeyPerevalov" | ||
- "@swatisehgal" | ||
owning-sig: sig-node | ||
participating-sigs: | ||
- sig-node | ||
- sig-scheduling | ||
reviewers: | ||
- "@dchen1107" | ||
- "@derekwaynecarr" | ||
- "@klueska" | ||
approvers: | ||
- "@dchen1107" | ||
- "@derekwaynecarr" | ||
creation-date: 2020-06-19 | ||
last-updated: 2020-08-12 | ||
status: implementable | ||
see-also: | ||
- "/keps/sig-scheduling/2044-simplified-topology-manager/README.md" | ||
--- |
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might not be enough to handle modern hardware. In case of AMD topology depends on configuration (https://developer.amd.com/wp-content/resources/56338_1.00_pub.pdf
https://developer.amd.com/wp-content/resources/56745_0.80.pdf) and Intel supports sub-NUMA clustering.