-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Simplified version of topology manager in kube-scheduler #1858
Simplified version of topology manager in kube-scheduler #1858
Conversation
Welcome @AlexeyPerevalov! |
Hi @AlexeyPerevalov. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
- "@AlexeyPerevalov" | ||
owning-sig: sig-scheduling | ||
participating-sigs: | ||
reviewers: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please add me as reviewer and approver, we also need a reviewer from a sig node lead.
|
||
## Non-Goals | ||
|
||
- Do not change Topology Manager behaviour to be able to work with policy in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess you meant to say changing topology manager behavior is a non goal, and so the phrasing should be like this:
- Do not change Topology Manager behaviour to be able to work with policy in | |
- Change Topology Manager behaviour to be able to work with policy in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would this be more descriptive of this non-goal: "Change the PodSpec to allow requesting a specific node topology manager policy"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, it would be. Thank you.
} | ||
if bitmask.IsEmpty() { | ||
// we can't align container, so we can't align a pod | ||
return framework.NewStatus(framework.Error, fmt.Sprintf("Can't align container: %s", container.Name)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return framework.NewStatus(framework.Error, fmt.Sprintf("Can't align container: %s", container.Name)) | |
return framework.NewStatus(framework.Unschedulable, fmt.Sprintf("Can't align container: %s", container.Name)) |
for resource, quantity := range container.Resources.Requests { | ||
resourceBitmask := bm.NewEmptyBitMask() | ||
if guarantedQoS(&container.Resources.Limits, resource, quantity) { | ||
for numaIndex, numaNodeResources := range numaMap { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we change this to match what you are proposing to add to NodeInfo (lines 100-105): a list of NUMANodeResources, and the index is NUMAID
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, sure, it would be better. It was intermediate map, since I tried several approaches, like prefixes with numa%d in ResourceName. Now it's not necessary, especially it may confuse here in proposal.
Available resources with topology of the node should be stored in CRD. Format of the topology described | ||
[in this document](https://docs.google.com/document/d/12kj3fK8boNuPNqob6F_pPU9ZTaNEnPGaXEooW1Cilwg/edit). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
who is going to maintain this CRD? sig-node?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CRD was suggested as one of the approach during discussion in sig-node meeting. And interested persons (from RedHat) were involved to discussion of CRD format. Since we plan to create/update CRD not directly from kubelet, but from separate daemon (implemented as daemon set), I think the authors of it will maintain CRD too, including me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Evolving the API spec from a CRD is a good starting point, at some time when the spec is mature, we can merge it back to a core API. RuntimeClass is a good example - it incubated as a CRD in alpha phase, and then merged back into upstream in the beta phase.
Node label contains the name of the topology policy currently implemented in kubelet. | ||
|
||
Proposed Node Label may look like this: | ||
`beta.kubernetes.io/topology=none|best-effort|restricted|single-numa-node` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this will be confusing, we already have topology.kubernetes.io
: https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/api/core/v1/well_known_labels.go#L22:1
We should run this by sig API Machinery to define a label.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also, why do we need this label at all? can't we have the policy in the CRD?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CRD describes node, so yes it's good idea to keep it there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to make this info described by CRD.
- This Proposal requires exposing NUMA topology information. This KEP doesn't | ||
describe how to expose all necessary information it just declare what kind of | ||
information is necessary. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but this is a blocker to having the scheduler work done, so I think we need both KEPs approved at the same time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We tried several approaches to export topology from the worker node, some of them required kubelet modification. And now we came to conclusion what it's better to avoid kubelet modification. There are two feasiable approaches: collect resources on the CRI level and using kubelet podresources interface (unix domain socket, but now it doesn't provide cpumanager information, only resources of the device plugin).
Here I agree, need a detailed description how it would be implemented. Probably KEP should be in sig-node, but as I mentioned before implementation should not touch kubelet.
The algorithm which implements single-numa-node policy is following: | ||
|
||
```go | ||
for _, container := range containers { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this code is also executed by kubelet/topologyManager, note that the scheduler can't take a dependency on kubelet, and so I suggest this logic be extracted into a pkg in staging that both kubelet and the scheduler import.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
original logic of TopologyManager has high runtime complexity, that's why here is simplified version. Maybe the best way would be moving whole TopologyManager to staging, but reusing it as is - impossible.
When I just started this task, I thought to move whole logic of TopologyManager into kube-scheduler, but it requires:
- factoring out TopologyManager totally as well as depended managers ( CPUManager/DeviceManager)
- need to move CPUManager/DeviceManager too. It requires sufficient changes in kubelet's API and probably impossible now.
of resources in that topology became actual. Pod could be scheduled on the node | ||
where total amount of resources are enough, but resource distribution could not | ||
satisfy the appropriate Topology policy. In this case the pod failed to start. Much | ||
better behaviour for scheduler would be to select appropriate node where admit |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose "admit handlers" is kubelet terms? If so, let's say:
better behaviour for scheduler would be to select appropriate node where admit | |
better behaviour for scheduler would be to select appropriate node where kubelet admit |
|
||
## Goals | ||
|
||
- Make scheduling process more precise when we have NUMA topology on the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove extra spaces. (applies elsewhere)
- Make scheduling process more precise when we have NUMA topology on the | |
- Make scheduling process more precise when we have NUMA topology on the |
Plugin checks the ability to run pod only in case of single-numa-node policy on the | ||
node, since it is the most strict policy, it implies that the launch on the node with | ||
other existing policies will be successful if the condition for single-numa-node policy passed for the worker node. | ||
Proposed plugin will use node label to identify which topology policy is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding is like we prefer to use CRD to describe whether the node has been enabled single-numa-node, as well as more numaMap info, right? If so, let's update the wordings here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right, will be updated.
if resourceBitmask.IsEmpty() { | ||
continue | ||
} | ||
bitmask.And(resourceBitmask) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we break
below to return early? (Unless we perfer to a full bitmask to log a verbosed scheduling failure)
Node label contains the name of the topology policy currently implemented in kubelet. | ||
|
||
Proposed Node Label may look like this: | ||
`beta.kubernetes.io/topology=none|best-effort|restricted|single-numa-node` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to make this info described by CRD.
Available resources with topology of the node should be stored in CRD. Format of the topology described | ||
[in this document](https://docs.google.com/document/d/12kj3fK8boNuPNqob6F_pPU9ZTaNEnPGaXEooW1Cilwg/edit). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Evolving the API spec from a CRD is a good starting point, at some time when the spec is mature, we can merge it back to a core API. RuntimeClass is a good example - it incubated as a CRD in alpha phase, and then merged back into upstream in the beta phase.
information is necessary. | ||
|
||
# Proposal | ||
Kube-scheduler builtin plugin will be added to the main tree. This plugin |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a blank line above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO there are 2 parts in scheduler side:
- In-tree changes on internal data structures to accommodate NUMAMap info, which will then exposed by the scheduler framework handle (SnapshotSharedLister).
- A new scheduler plugin to honor the NUMAMap info so that the scheduling decision is aligned with kubelet - we can discuss later whether we want to put it in-tree or out-of-tree.
@@ -0,0 +1,177 @@ | |||
--- | |||
title: Deducted version of TopologyManager in kube-scheduler |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is "Tailored" more proper? (Not an English expert though..)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
node-topology aware scheduling
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The meaning here was that we implement reduced/simplified version of topology manager.
Yes maybe better Node-topology aware scheduling - it's both a specific and an abstract.
@vpickard could you comment on this? |
Sorry, but I'm opposing this approach. Here is reasons why:
It is really, not the good idea to start introducing hardware architecture specifics and assumptions into scheduler. |
What could be a better alternative, in your opinion, to avoid the fundamental issue this KEP is trying to address, which is pods ending up in Topology Affinity Error, because the scheduler picks a node and then the Topology Manager in that node fails to properly align the requested resources? |
It really depends on what we really want to solve. In my opinion, there is need on scheduler level to solve fundamental error condition that can happen to any Pod, not only due to Topology Manager:
Second problem of "aligning resources": this problem is complex, and can't be assumed just because there is more common pattern of building servers or what actually for particular workload will be optimal "alignment". For some workloads it will require combination of CPUs / Memory Nodes and particular PCI devices aligned. For some memory might be not critical, as workload is not memory intensive and only CPU vs. PCI device matters. For some it might require memory from more than one region, etc... Exposing all of those conditions on scheduler level is not that simple task, and generally not needed for many of k8s setups where nodes are not exposing real hardware topology. If we want to go to that path, there are some fundamental changes that need to be done in kubelet on how to properly expose and count resources (and collect information about availability of such). There are some blueprints of that kind in virtual kubelet project, it might require changes in CRI apis and on runtimes side... So, I'd really suggest to focus on first fundamental issue mentioned above: "Pod is scheduled on the node, some (potentially) unrecoverable error for this particular during creation/starting occur. How we gracefully re-schedule it somewhere to another candidate node or properly return error to user if we can't run this pod anywhere else in the cluster". |
|
Thanks @kad for the deep and insightful response. A lot to unpack here. Now let me try to expand and clarify some point in order to (hopefully) make the further conversation easier.
Absolutely
This seems to suggest that we should not treat Topology Affinity Errors in a special way: they are yet another instance of node-related pod admission failures, no extra care needed in the scheduler. The approach described above solves the fundamental issue some of us (myself included!) have with the current behaviour of k8s, on which a pod rejected by Topology Manager ends up in error/TopologyAffinityError. This is suboptimal and should be improved. The drawback I can see in the approach you described above is is that the pod can spend quite some time waiting to be scheduled, because
So the issue is there is no guarantee whatsoever regarding how long takes for a pod to actually get running, even if we know ahead of time the cluster can actually run the pod. The pod can just be subject to an unlucky streak of random picks.
Makes sense to me. I can see why this problem is complex and hard. So as general direction it seems better to keep the knowledge of the HW details/topology on the node, and not propagate these details in the cluster, right? In other words, from resource assignment perspective, if we keep this check in the kubelet, we will never know if a node can actually run a given pod (/container) unless we try to schedule on that specific node and we let Topology Manager (or any other node component which knows about all these detials) do this check. This is not necessarily a problem, but if we all as community we agree that this is the right direction, then this becomes a pretty strong constraint we all should be well aware of, so I'm taking the chance to write it down explicitely :)
This is an approach we talked about (internally) like a couple month ago. Besides the unpredictable scheduling delay I described above, nothing really wrong here from my perspective, and could be a nice first step.
Does this look right? |
In general, what we have discussed here falls into 2 directions:
A 3rd option comes into my mind which combines the above 2 ones. Here is a very rough idea: scheduler doesn't add in topology-related computation; instead, it suggests more than 1 node (if there are) for pod placement by modifying the node's BTW: to avoid compute overhead on every potential node, the suggested number should be configured, such as defaulting to 5. |
I have a question regarding 3rd option:
|
There it was some initial talking about this approach. The initial idea was something along these lines. The biggest problem was the existing controllers, like in this scenario:
Looks neat! However, how should the competition be regulated? IOW, how this model interacts with configured replicas? Let's say I run a deployment which wants exactly one of my pod, I think it may very well happen in this scenario that, say, three nodes try to run my pod, only one wins (obviously) and keep the pod going, and the losers notice and silently kill their pods. Is my understanding right? |
Re @AlexeyPerevalov:
Yes, as well as other existing Filtering constraints.
Nope, there is still no interaction between kube-scheduler and kuebelet. Here is how it works: now only one kubelet "claims" the pod (with .spec.nodName set by scheduler), so it's likely that admit handler may fail it. However, if multiple kubelets "pre-claim" the pod, and hopefully at least one can admit the pod, or several kubelets competes in "claim"ing the pod. Re @fromanirh, I understand both option 2 & 3 are very rough ideas, but would be good to brainstorm here.
Not really. There is always one pod for each replica. It's just the logic gets changed from "one kubelet admit and run the pod" to "N kubelets admit and compete to run the pod". So there is no concept of "kill their pod" as there is only one pod. Only one kubelet winner will be able to change the pod's ".spec.nodeName" to its node name, that's it. |
The problem with providing multiple node nominations is that it still doesn't guarantee that a node will admit the pod, moreover we get into a tuning issue where we need to select how many nodes to nominate, which will likely differ by cluster size, current cluster utilization etc. |
8c263bd
to
1c2dce0
Compare
It looks like new stage of scheduling. kube-scheduler says these nodes are nominated and nodes are going to run its admit handler, if it passed kubelet on the node change e.g. pod's status to ReadyToRunOnTheNode or AdmitHandlerPassed, kube-scheduler tracks it down and changes pod's spec.nodeName to appropriate node. But such behavior is necessary only for nodes with enabled TopologyManager and only for such nodes scheduling will be little bit longer. |
I think maximum number of nominated nodes could be defined in configuration or yes evaluated dynamically. |
As part of the enablement of Topology-aware scheduling in kubernetes, the scheduler plugin has been proposed as in-tree. KEP: kubernetes/enhancements#1858 PR: kubernetes/kubernetes#90708 To enable faster development velocity, testing and community adoption we are also packaging as an out of tree scheduler plugin. This out-of-tree implementation is based on the above PR and KEP.
As part of the enablement of Topology-aware scheduling in kubernetes, the scheduler plugin has been proposed as in-tree. KEP: kubernetes/enhancements#1858 PR: kubernetes/kubernetes#90708 To enable faster development velocity, testing and community adoption we are also packaging as an out of tree scheduler plugin. This out-of-tree implementation is based on the above PR and KEP.
As part of the enablement of Topology-aware scheduling in kubernetes, the scheduler plugin has been proposed as in-tree. KEP: kubernetes/enhancements#1858 PR: kubernetes/kubernetes#90708 To enable faster development velocity, testing and community adoption we are also packaging as an out of tree scheduler plugin. This out-of-tree implementation is based on the above PR and KEP.
As part of the enablement of Topology-aware scheduling in kubernetes, the scheduler plugin has been proposed as in-tree. KEP: kubernetes/enhancements#1858 PR: kubernetes/kubernetes#90708 To enable faster development velocity, testing and community adoption we are also packaging as an out of tree scheduler plugin. This out-of-tree implementation is based on the above PR and KEP.
m |
We decided to focus on out-of-tree plugin kubernetes-sigs/scheduler-plugins#119. This KEP, will be postponed until noderesourcetopology-api not in the staging or built-in. |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-contributor-experience at kubernetes/community. |
Stale issues rot after 30d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-contributor-experience at kubernetes/community. |
/close |
@AlexeyPerevalov: Closed this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
- This KEP consolidates the following two KEPs into one - kubernetes#1858 - kubernetes#1870 - Also the KEP talks about introducing NodeResourceTopology as a native Kubernetes resource. Signed-off-by: Swati Sehgal <swsehgal@redhat.com>
- This KEP consolidates the following two KEPs into one - kubernetes#1858 - kubernetes#1870 - Also the KEP talks about introducing NodeResourceTopology as a native Kubernetes resource. Co-authored-by: Alexey Perevalov <alexey.perevalov@huawei.com> Signed-off-by: Swati Sehgal <swsehgal@redhat.com>
- This KEP consolidates the following two KEPs into one - kubernetes#1858 - kubernetes#1870 - Also the KEP talks about introducing NodeResourceTopology as a native Kubernetes resource. Co-authored-by: Alexey Perevalov <alexey.perevalov@huawei.com> Signed-off-by: Swati Sehgal <swsehgal@redhat.com>
- This KEP consolidates the following two KEPs into one - kubernetes#1858 - kubernetes#1870 - Also the KEP talks about introducing NodeResourceTopology as a native Kubernetes resource. Co-authored-by: Alexey Perevalov <alexey.perevalov@huawei.com> Signed-off-by: Swati Sehgal <swsehgal@redhat.com>
- This KEP consolidates the following two KEPs into one - kubernetes#1858 - kubernetes#1870 - Also the KEP talks about introducing NodeResourceTopology as a native Kubernetes resource. Co-authored-by: Alexey Perevalov <alexey.perevalov@huawei.com> Signed-off-by: Swati Sehgal <swsehgal@redhat.com>
- This KEP consolidates the following two KEPs into one - kubernetes#1858 - kubernetes#1870 - Also the KEP talks about introducing NodeResourceTopology as a native Kubernetes resource. Co-authored-by: Alexey Perevalov <alexey.perevalov@huawei.com> Signed-off-by: Swati Sehgal <swsehgal@redhat.com>
- This KEP consolidates the following two KEPs into one - kubernetes#1858 - kubernetes#1870 - Also the KEP talks about introducing NodeResourceTopology as a native Kubernetes resource. Co-authored-by: Alexey Perevalov <alexey.perevalov@huawei.com> Signed-off-by: Swati Sehgal <swsehgal@redhat.com>
- This KEP consolidates the following two KEPs into one - kubernetes#1858 - kubernetes#1870 - Also the KEP talks about introducing NodeResourceTopology as a native Kubernetes resource. Co-authored-by: Alexey Perevalov <alexey.perevalov@huawei.com> Signed-off-by: Swati Sehgal <swsehgal@redhat.com>
- This KEP consolidates the following two KEPs into one - kubernetes#1858 - kubernetes#1870 - Also the KEP talks about introducing NodeResourceTopology as a native Kubernetes resource. Co-authored-by: Alexey Perevalov <alexey.perevalov@huawei.com> Signed-off-by: Swati Sehgal <swsehgal@redhat.com>
Ideas of it were discussed in following documents:
https://docs.google.com/document/d/1gPknVIOiu-c_fpLm53-jUAm-AGQQ8XC0hQYE7hq0L4c/edit?ts=5ea71ac0#
https://docs.google.com/presentation/d/1u6fD_8xf40dMTbpwBQmanOpI-WqHioryovHozUDfKhU/edit#slide=id.g7fc7df652a_29_2
https://docs.google.com/presentation/d/1y2ObdZUtMFp2Z0EeyUu1cm469EQNMBjJbhjKiHwSeQo/edit#slide=id.p
It relies on #1870
Fixes: kubernetes/kubernetes#84869
Issue: kubernetes/kubernetes#84869
Signed-off-by: Alexey Perevalov alexey.perevalov@huawei.com