-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Topology aware resource provisioning daemon #1870
Topology aware resource provisioning daemon #1870
Conversation
Hi @AlexeyPerevalov. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
reviewers: | ||
- "@dchen1107" | ||
- "@derekwaynecarr" | ||
approvers: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please add @klueska
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, we'll add him.
could be a situation when TopologyManager's admit handler on the kubelet | ||
side refuses pod launching since pod's resources request doesn't sutisfy | ||
selected TopologyManager policy, in this case pod should be rescheduled. | ||
And so it can get stuck in the rescheduling cycle. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this solution assumes the desired outcome is for a pod to remain pending until a viable node is found. another viable solution is for the scheduler to be aware of historical scheduling decisions of similar pods that reached an unexpected terminal state prior to running.
is this solution optimizing for scheduling time? how quickly do we think a scheduler can make a decision in this situation, especially where secondary affinity across devices is not obvious? for example, the hints that were added to the device plugin that lets the kubelet understand a preferred set of co-located devices based on device plugin itself.
/cc @klueska
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIK, If the pod is determined to be unschedulable or if there is an internal error it is returned to the scheduling queue and retried. While the scheduling of the pod is being retried, it appears to the end user in pending state. @ahg-g @Huang-Wei Can you please confirm the above, also regarding Derek's question is there currently a way to be aware of historical decisions/failures of a pod?
Regarding, optimizing the solution for scheduling time, this daemon creates CRDs based on resource information gathered from CRD/PodResource API. #1884 aims to allow gathering device information along with numa ids from PodResource API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please queue this up for discussion in sig-node at your convenience.
my reading of the kep is that if proceeded, it would be a new sub-project. i wonder if node-feature-discovery could take on this use case so we avoid having more than one fleecing agent in the project for similar use cases.
please discuss if node-feature-discovery is a useful home to add this feature from your perspective.
Resources used by the pod could be obtained by podresources interface | ||
of the kubelet and by CRI from container runtime. | ||
|
||
To calculate available resources need to know all resources |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think there is a tension here with runtime classes.
also keep in mind that adoption of the kube-reserved versus system-reserved cgroups is low. most deployers still organize kube as a part of system-reserved budget.
Containerd has been used as the container runtime in the initial investigation. The internal container object info | ||
[here](https://github.com/containerd/cri/blob/master/pkg/server/container_status.go#L130) | ||
|
||
The Daemon set is responsible for the following: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this only works if the container is running, right?
if the container is in crashloop backoff, or is unable to start right away (lets say the volume has not yet mounted), by introspecting the CRI state, you get an incomplete view. have you thought about whether the kubelet itself should expose an endpoint?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We've considered two ways of obtaining topology information CRI and podresources. Bot interface should be extended to do it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your assessment here is correct, we check for resources allocated to running pods. We haven't explored kubelet exposing an endpoint yet.
|
||
### Design based on podresources interface of the kubelet | ||
|
||
Podresources interface of kubelet described in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think @dashpole should review this carefully.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, we already started a KEP (#1884). It covers our requirements (maybe except memory).
``` | ||
|
||
The code for working with it is generated by https://github.com/kubernetes/code-generator.git | ||
One CRD instance contains information of available resources of the appropriate worker node. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a single global identity presenting this information or do you anticipate something similar to how NodeRestriction works which limits the capability of the node to report only on things it knows about (and not other nodes).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you afraid, that one worker node could modify CRD of another?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes.
metadata: | ||
name: noderesourcetopology-handler | ||
rules: | ||
- apiGroups: ["k8s.cncf.io"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are you advocating that the sub-project that would build this daemon would be sponsored by sig-node, or is there some work happening elsewhere in cncf that this API group is referencing? if the desire is to sponsor this as a sub-project of sig-node, this would go under x-k8s.io.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no this work wasn't done under cncf, so "node.k8s.io" or "discovery.k8s.io" is more proper apiGroup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"topology.node.kubernetes.io" could be our other viable option
name: noderesourcetopology-account | ||
``` | ||
|
||
`serviceAccountName: noderesourcetopology-account` would have to be added to the manifest file of the Daemon. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ideally we could get to a per node daemonset identity in future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, made note of an action item to address this in the future.
|
||
* The feature has been stable and reliable in the past several releases. | ||
* Documentation should exist for the feature. | ||
* Test coverage of the feature is acceptable. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is the recommended test strategy?
what resources are needed in the project to sustainable maintain the signals they produce.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The same requirements for hardware as for TopologyManager, but maybe here is better to have at least 2 nodes with baremetal installation or VM with NUMA topology. I'll describe it in details here.
@derekwaynecarr if it will be possible to land CRD and its staff (informer/lister/clientset) and code for CRD creation (whole exporting logic) there in node-feature-discovery repository, node-feature-discovery is not in the staging of kubernetes project, but we still need CRD informer/clientset in the kubernetes repository for kube-scheduler plugin. |
metav1.TypeMeta `json:",inline"` | ||
metav1.ObjectMeta `json:"metadata,omitempty"` | ||
|
||
TopologyPolicy string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is this daemon going to obtain the node's topology policy?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This daemon should have access to KubletConfiguration. If it in the file system it should be able to read it. Dynamic Kubelet Configuration I think also will be supported.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about flags?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Handling of flags could be useful especially for --reserved-cpus, which CPUs are reserved for the kubelet and which can be used for the workload. Workloads that need pinning could ask the scheduler for lets say 3 isolated/pinnable/dedicates CPUs, which are needed e.g. for things related to low latency or high throughput.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In case of command line option nfd-worker should have access to procfs. It looks not so graceful (requires more permissions). Maybe in this case extend planned GetAvailableDevices to GetAvailableResources and provide list of cpu id, it can help to avoid dealing with --reserved-cpus, system & kube cgroups.
|
||
Currently it doesn't provide information from CPUManager only from | ||
DeviceManager as well as information about memory which was used. | ||
But information provided by podresource could be easily extended |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I personally prefer representing CPUs separately in the API instead of making them a specially named resource. The latter would make it more difficult to understand IMO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, we need to update this KEP according to our podresources KEP. We just waited response from Node Feature Discovery community.
Currently it doesn't provide information from CPUManager only from | ||
DeviceManager as well as information about memory which was used. | ||
But information provided by podresource could be easily extended | ||
without modification of API. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know you included it in the other KEP, but it would be good to paste the proposed API change here as well for future reference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agree
f531201
to
f5985cc
Compare
f5985cc
to
cfaa44d
Compare
a831f2f
to
f3b68c7
Compare
Going forward we may think of an interface where we can plug-in also other schedulers that may have even more restrictions and need a finer granularity. If we are using NFD for exposing the node features we may think about how to expose even more peculiar features and expose them in the cluster. In the EDGE use-case there was some work on using "devicetree" to add more features to NFD. In the HPC use-case we have "build-tools" that need far more than just the cpu-flags (model, architecture, ... ) this was also needed for kubeVirt. There are also other performance features like The other angle is all the tools/operators that we have already running on the cluster may also want to expose specific features, which means we need an common interface in NFD for other entities to publish their "features' ' (currently side-car container). Lets take e.g. NTO which can isolate cores, We have here three main parts in the chain that we need to consider:
Another source could also be the kubelet settings, especially e.g. the --reserved-cpus exposing which CPU are reserved for the kubelet and which can be used for the workload. Workloads that need pinning could ask the scheduler for lets say 3 isolated/pinnable CPUs, which are needed e.g. for things related to low latency or high throughput. |
For the first iteration we was planning to provide information described in this document: https://docs.google.com/document/d/12kj3fK8boNuPNqob6F_pPU9ZTaNEnPGaXEooW1Cilwg/edit# |
Thanks for sharing this idea! Our use case (Topology aware scheduler) would be a perfect candidate for testing out this proposed capability when its ready! Regarding this proposed idea of plugging in scheduler I have a few questions:
I think this proposal should consider supporting scheduling profiles too. |
Signed-off-by: Alexey Perevalov <alexey.perevalov@huawei.com>
- Required to access NodeResourceTopology CRD instances - Update minor formatting issues Signed-off-by: Swati Sehgal <swsehgal@redhat.com> Signed-off-by: Alexey Perevalov <alexey.perevalov@huawei.com>
…discussions - Rephrased the Summary section. - Captured flexible/generic CRD specification (details provided here:https://docs.google.com/document/d/12kj3fK8boNuPNqob6F_pPU9ZTaNEnPGaXEooW1Cilwg/edit). - Moved design based on CRI as to the alternatives section because of its drawbacks. Signed-off-by: Swati Sehgal <swsehgal@redhat.com>
Signed-off-by: Swati Sehgal <swsehgal@redhat.com> Signed-off-by: Alexey Perevalov <alexey.perevalov@huawei.com>
5c43512
to
6a8d773
Compare
m |
/ok-to-test |
### NUMA specification in ResourceName | ||
|
||
The representation of resource consists of two parts subdomain/resourceName. Where | ||
subdomain could be omitted. Subdomain contains vendor name. It doesn't suit well for | ||
reflecting NUMA information of the node as well as / delimeter since subdomain is optional. | ||
So new delimiter should be introduced to separate it from subdomain/resourceName. | ||
|
||
It might look like: | ||
numa%d///subdomain/resourceName | ||
|
||
%d - number of NUMA node | ||
/// - delimeter | ||
numa%d/// - could be omitted | ||
|
||
This approach may have side effects. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might not be enough to handle modern hardware. In case of AMD topology depends on configuration (https://developer.amd.com/wp-content/resources/56338_1.00_pub.pdf
https://developer.amd.com/wp-content/resources/56745_0.80.pdf) and Intel supports sub-NUMA clustering.
- This KEP consolidates the following two KEPs into one - kubernetes#1858 - kubernetes#1870 - Also the KEP talks about introducing NodeResourceTopology as a native Kubernetes resource. Signed-off-by: Swati Sehgal <swsehgal@redhat.com>
- This KEP consolidates the following two KEPs into one - kubernetes#1858 - kubernetes#1870 - Also the KEP talks about introducing NodeResourceTopology as a native Kubernetes resource. Co-authored-by: Alexey Perevalov <alexey.perevalov@huawei.com> Signed-off-by: Swati Sehgal <swsehgal@redhat.com>
- This KEP consolidates the following two KEPs into one - kubernetes#1858 - kubernetes#1870 - Also the KEP talks about introducing NodeResourceTopology as a native Kubernetes resource. Co-authored-by: Alexey Perevalov <alexey.perevalov@huawei.com> Signed-off-by: Swati Sehgal <swsehgal@redhat.com>
- This KEP consolidates the following two KEPs into one - kubernetes#1858 - kubernetes#1870 - Also the KEP talks about introducing NodeResourceTopology as a native Kubernetes resource. Co-authored-by: Alexey Perevalov <alexey.perevalov@huawei.com> Signed-off-by: Swati Sehgal <swsehgal@redhat.com>
- This KEP consolidates the following two KEPs into one - kubernetes#1858 - kubernetes#1870 - Also the KEP talks about introducing NodeResourceTopology as a native Kubernetes resource. Co-authored-by: Alexey Perevalov <alexey.perevalov@huawei.com> Signed-off-by: Swati Sehgal <swsehgal@redhat.com>
- This KEP consolidates the following two KEPs into one - kubernetes#1858 - kubernetes#1870 - Also the KEP talks about introducing NodeResourceTopology as a native Kubernetes resource. Co-authored-by: Alexey Perevalov <alexey.perevalov@huawei.com> Signed-off-by: Swati Sehgal <swsehgal@redhat.com>
- This KEP consolidates the following two KEPs into one - kubernetes#1858 - kubernetes#1870 - Also the KEP talks about introducing NodeResourceTopology as a native Kubernetes resource. Co-authored-by: Alexey Perevalov <alexey.perevalov@huawei.com> Signed-off-by: Swati Sehgal <swsehgal@redhat.com>
- This KEP consolidates the following two KEPs into one - kubernetes#1858 - kubernetes#1870 - Also the KEP talks about introducing NodeResourceTopology as a native Kubernetes resource. Co-authored-by: Alexey Perevalov <alexey.perevalov@huawei.com> Signed-off-by: Swati Sehgal <swsehgal@redhat.com>
- This KEP consolidates the following two KEPs into one - kubernetes#1858 - kubernetes#1870 - Also the KEP talks about introducing NodeResourceTopology as a native Kubernetes resource. Co-authored-by: Alexey Perevalov <alexey.perevalov@huawei.com> Signed-off-by: Swati Sehgal <swsehgal@redhat.com>
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten |
@AlexeyPerevalov: The following tests failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /close |
@k8s-triage-robot: Closed this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
This PR describes a daemon which responsible for provisioning resources with topology information.
Resources will be used by kube-scheduler plugin responsible for topology aware scheduling
it described in this KEP #1858
Issue: kubernetes/kubernetes#84869
Ref: #2051