-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Topology awareness in Kube-scheduler #2787
Topology awareness in Kube-scheduler #2787
Conversation
/cc @alculquicondor |
|
||
Kube-scheduler plugin will be moved from kuberntes-sigs/scheduler-plugin (or out-of-tree) | ||
plugin into the main tree as a built-in plugin. This plugin implements a simplified version of Topology Manager and hence is different from original topology manager algorithm. Plugin would | ||
be disabled by default and when enabled would check for the ability to run pod only in case of single-numa-node policy on the node, since it is the most strict policy, it implies that the launch on the node with other existing policies will be successful if the condition for single-numa-node policy passed for the worker node. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what if the node is not under this policy? It sound like it could lead to underutilization.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is a sys-admin decision to configure kubelet with Topology Manager policy. Not all workloads require strict NUMA alignment and therefore Topology Manager policy could be configured to be none or restricted. More information on this is here. The default policy is none
where Topology Manager does not perform any topology alignment. But for a workload that requires NUMA alignment a node where single-numa-node policy
is not configured would lead to underperformance of the workload.
|
||
- Make scheduling process more precise when we have NUMA topology on the | ||
worker node. | ||
- Enhance the node object to capture topology information which can be referred to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like you are adding a new resource, not changing the existing Node object. Not saying that you should. Just update this point to reflect the actual proposal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would also argue that this is not a goal on of itself, this is a tool to achieve the goal, which is numa-aware scheduling.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To highlight that we are adding a new resource, I would explicitly add a paragraph at the end of the summary section to describe the changes/artifacts:
- A new scheduler plugin that makes topology-aware placement decisions
- A new resource object,
NodeResourceTopology
to communicate NUMA status between kubelet and kube-scheduler - kubelet changes to populate
NodeResourceTopology
Capacity string `json:"capacity"` | ||
} | ||
|
||
type ZoneList []Zone |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I don't see the need for these aliases, just use the slices directly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, will remove them
metav1.ObjectMeta `json:"metadata,omitempty"` | ||
|
||
TopologyPolicy []string `json:"topologyPolicies"` | ||
Zones ZoneMap `json:"zones"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ZoneMap is not defined
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That was a mistake, will fix
Name string `json:"name"` | ||
Type string `json:"type"` | ||
Parent string `json:"parent,omitempty"` | ||
Costs CostList `json:"costs,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this what is currently used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we do. You can find the current API we use here: https://github.com/k8stopologyawareschedwg/noderesourcetopology-api/blob/master/pkg/apis/topology/v1alpha1/types.go
|
||
`serviceAccountName: noderesourcetopology-account` would have to be added to the manifest file of the scheduler deployment file. | ||
|
||
# Use cases |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please follow the template: this should be within the Proposal section.
|
||
Once the information is captured in the NodeResourceTopology API, the scheduler can refer to | ||
it like it refers to Node Capacity and Allocatable while making a Topology-aware Scheduling decision. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add
# Design details
|
||
Numbers of kubernetes worker nodes on bare metal with NUMA topology. TopologyManager feature gate enabled on the nodes. In this configuration, the operator does not want that in the case of an unsatisfactory host topology, it should be re-scheduled for launch, but wants the scheduling to be successful the first time. | ||
|
||
# Known limitations |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also part of the Proposal, as either Notes/Constraints/Caveats
or Risks and Mitigations
|
||
# Known limitations | ||
|
||
Kube-scheduler makes an assumption about current resource usage on the worker node, since kube-scheduler knows which pod assigned to node. This assumption makes right after kube-scheduler choose a node. But in case of scheduling with NUMA topology only TopologyManager on the worker node knows exact NUMA node used by pod, this information about NUMA node delivers to kube-scheduler with latency. In this case kube-scheduler will not know actual NUMA topology until topology exporter will send it back. It could be mitigated if kube-scheduler in proposed plugin will add a hint on which NUMA id pod could be assigned, further Topology Manager on the worker node may take it into account. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This paragraph needs some work.
Maybe you can put it as a numbered list to describe the sequence of events.
The main point is that we could have a similar situation of pods being scheduled to nodes where they don't fit, due to latency of the topology manager (part of the kubelet) reporting.
Is the mitigation something that you plan to do in alpha or beta?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this does need to be reworded. Will do that.
We can certainly work on enabling some sort of communication between the Topology aware scheduler and the Kubelet to take the Numa node identified by the scheduler into consideration while aligning resources in alpha or beta.
* Alpha (v1.23) | ||
|
||
Following changes are required: | ||
- [ ] Introducing a Topolgy information as part of Node API |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also:
population by kubelet
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use the canonical template: https://github.com/kubernetes/enhancements/tree/master/keps/NNNN-kep-template
In order to address this issue, scheduler needs to choose a node considering resource availability | ||
along with underlying resource topology and Topology Manager policy on the worker node. | ||
|
||
This document describes behaviour of the Kubernetes Scheduler which takes worker node topology into account. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This document describes behaviour of the Kubernetes Scheduler which takes worker node topology into account. | |
This enhancement proposes changes to make kube-scheduler aware of node NUMA topology when making scheduling decisions. |
|
||
- Make scheduling process more precise when we have NUMA topology on the | ||
worker node. | ||
- Enhance the node object to capture topology information which can be referred to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would also argue that this is not a goal on of itself, this is a tool to achieve the goal, which is numa-aware scheduling.
## Non-Goals | ||
|
||
- Change the PodSpec to allow requesting a specific node topology manager policy | ||
- This Proposal requires exposing NUMA topology information. This KEP doesn't |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isn't that information already known by topology manager, which is part of kubelet?
|
||
- Make scheduling process more precise when we have NUMA topology on the | ||
worker node. | ||
- Enhance the node object to capture topology information which can be referred to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To highlight that we are adding a new resource, I would explicitly add a paragraph at the end of the summary section to describe the changes/artifacts:
- A new scheduler plugin that makes topology-aware placement decisions
- A new resource object,
NodeResourceTopology
to communicate NUMA status between kubelet and kube-scheduler - kubelet changes to populate
NodeResourceTopology
plugin into the main tree as a built-in plugin. This plugin implements a simplified version of Topology Manager and hence is different from original topology manager algorithm. Plugin would | ||
be disabled by default and when enabled would check for the ability to run pod only in case of single-numa-node policy on the node, since it is the most strict policy, it implies that the launch on the node with other existing policies will be successful if the condition for single-numa-node policy passed for the worker node. | ||
|
||
To work, this plugin requires topology information of the available resource on the worker nodes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To work, this plugin requires topology information of the available resource on the worker nodes. | |
To work, this plugin requires topology information of the available resources for each NUMA cell on worker nodes. |
The current policies of TopologyManager can't coexist together at the same time, but in future such kind of policies could appear. | ||
For example we can have policy for HyperThreading and it can live with NUMA policies. | ||
|
||
To use these policy names both in kube-scheduler and in kubelet, string constants of these labels should be moved from pkg/kubelet/cm/topologymanager/ and pkg/kubelet/apis/config/types.go to pkg/apis/core/types.go a one single place. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is not necessary to discuss in the KEP, lets focus on the enhancement itself, its semantics and API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, will remove this.
|
||
## Plugin implementation details | ||
|
||
### Description of the Algorithm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this section should discuss semantics of the feature, the dependencies, the flow of information and any potential race conditions.
A description of how things will work and status updates starting from node/scheduler startup, scheduling a pod, scheduling another pod before the first one getting picked up by kubelet, deleing a pod etc. What changes does all of the cause to the state in kubelet, api-server and scheduler if any.
|
||
The algorithm which implements single-numa-node policy is following: | ||
|
||
```go |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, please avoid code, see also comment above.
|
||
|
||
|
||
# Alternative Solution |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in this section, I would describe the alternative proposal, pros/cons compared to primary one and why the primary one is better. Also, this is not proposing an end-to-end alternative, only the kubelet part, if so please make that clear as well.
The proto details mentioned in the following section are not really helpful in evaluating that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you discuss the following potential end-to-end alternative: 1:1 worker pod to node assignment, so apart from kubelet and daemonsets, the pod will take the whole node and the application is responsible for forking processes and assign them to NUMA cells. How feasible is this, do we have frameworks/libraries that allows applications to do that (e.g., do MPI libraries enable this)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in this section, I would describe the alternative proposal, pros/cons compared to primary one and why the primary one is better. Also, this is not proposing an end-to-end alternative, only the kubelet part, if so please make that clear as well.
I will fix this and make this more coherent.
can you discuss the following potential end-to-end alternative: 1:1 worker pod to node assignment, so apart from kubelet and daemonsets, the pod will take the whole node and the application is responsible for forking processes and assign them to NUMA cells. How feasible is this, do we have frameworks/libraries that allows applications to do that (e.g., do MPI libraries enable this)?
This is a very interesting point! Topology Manager can deal with the alignment of resources at container and pod scope. So if we have a pod with multiple containers, a process running inside the containers can be assigned to a NUMA cell but I don't think we would be able to more granular than that (assign processes inside container to separate NUMA nodes) without making major changes to Kubelet.
|
||
The daemon which runs outside of the kubelet will collect all necessary information on running pods, based on allocatable resources of the node and consumed resources by pods it will provide available resources in CRD, where one CRD instance represents one worker node. The name of the CRD instance is the name of the worker node. | ||
|
||
## CRD API |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it still a CRD even if we plan to host it in core?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was part of the alternative solution section
d78bc60
to
ca07ed5
Compare
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: swatisehgal The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
- This KEP consolidates the following two KEPs into one - kubernetes#1858 - kubernetes#1870 - Also the KEP talks about introducing NodeResourceTopology as a native Kubernetes resource. Co-authored-by: Alexey Perevalov <alexey.perevalov@huawei.com> Signed-off-by: Swati Sehgal <swsehgal@redhat.com>
8bcbf30
to
abdc1bf
Compare
abdc1bf
to
25b4cef
Compare
- "@huang-wei" | ||
- "@derekwaynecarr" | ||
- "@mrunalp" | ||
- "@rphilips" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rphillips
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, will fix now.
Signed-off-by: Swati Sehgal <swsehgal@redhat.com>
1. If requested resource cannot be found on a node, it is unset as available NUMA cell | ||
1. If an unknown resource has 0 quantity, the NUMA cell should be left set. | ||
* The following checks are performed: | ||
1. Add NUMA cell to the resourceBitmask if resource is cpu and it's not guaranteed QoS, since cpu will flow |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do constants within the API for 'cpu' and 'memory' make sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... perhaps others?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"cpu", "memory" and other resources are not stored as constants but rather stored in the ResourceInfo data structure below where the name corresponds the resource name stored here as value which is obtained from the v1.ResourceCPU or v1.ResourceMemory and in the scheduler plugin a resource is compared with values v1.ResourceCPU or v1.ResourceMemory to determine if the resource is a CPU or memory.
type ResourceInfo struct {
Name string `json:"name"`
Allocatable string `json:"allocatable"`
Capacity string `json:"capacity"`
}
25b4cef
to
586c982
Compare
Name string `json:"name"` | ||
Value string `json:"value"` | ||
} | ||
// Kubelet writes to NodeResourceTopology |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rphillips Here I have provided real world example of how these fields would look like once populated. Thought it was best to do that after all the data structures have been defined. Would you prefer for it to be moved before the data structures are defined or maybe in a different section?
- "@ehashman" | ||
- "@klueska" | ||
approvers: | ||
- "@sig-scheduling-leads" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
needs approver from sig-node as well.
#prr-approvers: | ||
|
||
see-also: | ||
- "https://github.com/kubernetes/enhancements/pull/1870" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These should be links to related merged KEPs. Example: /keps/sig-node/693-topology-manager
|
||
### Risks and Mitigations | ||
|
||
Topology Manager on the worker node knows exact resources and their NUMA node allocated to pods but the and node resource |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it looks like there is some typo here.
actual NUMA topology until the information of the available resources at a NUMA node level is evaluated in the kubelet which | ||
could still lead to scheduling of pods to nodes where they won't be admitted by Topology Manager. | ||
|
||
This can be mitigated if kube-scheduler provides a hint of which NUMA ID a pod should be assigned and Topology Manager on the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Elaborate. How is the NUMA ID used in follow-up scheduling cycles?
Costs CostList `json:"costs,omitempty"` | ||
Attributes AttributeList `json:"attributes,omitempty"` | ||
Resources ResourceInfoList `json:"resources,omitempty"` | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
better also provide the comments
|
||
1. At the filter extension point of the plugin, the QoS class of the pod is determined, in case it is a best effort pod or the | ||
Topology Manager Policy configured on the node is not single-numa-node, the node is not considered for scheduling | ||
1. The Topology Manager Scope is determined. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you elaborate what this is?
|
||
- [ ] Implementation of Score extension point | ||
- [ ] Add node E2E tests. | ||
- [ ] Provide beta-level documentation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Documentation should be part of alpha too
|
||
## Design Details | ||
|
||
- add a new flag in Kubelet called `ExposeNodeResourceTopology` in the kubelet config or command line argument called `expose-noderesourcetopology` which allows |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Take a decision here: Are you proposing a permanent command line flag or not? Note that this is different from the feature gate, which should disappear in a a few releases
CRI or CNI may require updating that component before the kubelet. | ||
--> | ||
|
||
Feature flag will apply to kubelet only, so version skew strategy is N/A. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what happens in the kubelet when you disable the flag? Does the NodeResourceTopology
object get deleted?
If I downgrade a kubelet, note that the object might stay there.
- [X] Enable Scheduler scheduler plugin `NodeResourceTopologyMatch` in the KubeScheduler config | ||
- Describe the mechanism: | ||
- Will enabling / disabling the feature require downtime of the control | ||
plane? Yes, Feature gate must be set on kubelet start. To disable, kubelet must be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
kube-scheduler
// Kubelet writes to NodeResourceTopology | ||
// and scheduler plugin reads from it | ||
// Real world example of how these fields are populated is as follows: | ||
// Cells: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cells is map. Can you add the key of Cells map in this example?
metav1.TypeMeta `json:",inline"` | ||
metav1.ObjectMeta `json:"metadata,omitempty"` | ||
|
||
TopologyPolicies []string `json:"topologyPolicies"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you list the policies? And explain them?
like:SingleNUMANodeContainerLevel、SingleNUMANodePodLevel ?
Costs CostList `json:"costs,omitempty"` | ||
Attributes AttributeList `json:"attributes,omitempty"` | ||
Resources ResourceInfoList `json:"resources,omitempty"` | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for comments
Just by reading the examples, I still don't understand what is costs.
1. Add NUMA cell to the resourceBitmask if resource is memory and it's not guaranteed QoS, since memory will flow | ||
1. Add NUMA cell to the resourceBitmask if zero quantity for non existing resource was requested | ||
1. otherwise check amount of resources | ||
* Once the resourceBitMark is determined it is ANDed with the cummulative bitmask |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cumulative ?
1. otherwise check amount of resources | ||
* Once the resourceBitMark is determined it is ANDed with the cummulative bitmask | ||
4. If resources cannot be aligned from the same NUMA cell for a container, alignment cannot be achieved for the entire pod and the resource cannot be | ||
aligned in case of the pod under consideration. Such a pod is returned with a Status Unschedulable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we return Unschedulable
, it will enter the postfilter phase. But we don't know which cell the pod is assigned to. So the DeletePod in PostFilter will not change the NodeResourceTopology
. So it will always return false in PodPassesFiltersOnNode
. Then this pod can never choose one or more pods to seize preempt.
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten |
/remove-lifecycle rotten |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
/remove-lifecycle stale |
Is this topic is still in progress? We hope to help the community continue to promote this work. |
The first version of |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. This bot triages PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /close |
@k8s-triage-robot: Closed this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
This KEP consolidates the following two KEPs into one:
Also the KEP talks about introducing NodeResourceTopology
as a native Kubernetes resource.
Fixes: kubernetes/kubernetes#84869