Simplified version of topology manager in kube-scheduler #1858

AlexeyPerevalov · 2020-06-12T08:33:16Z

Ideas of it were discussed in following documents:
https://docs.google.com/document/d/1gPknVIOiu-c_fpLm53-jUAm-AGQQ8XC0hQYE7hq0L4c/edit?ts=5ea71ac0#
https://docs.google.com/presentation/d/1u6fD_8xf40dMTbpwBQmanOpI-WqHioryovHozUDfKhU/edit#slide=id.g7fc7df652a_29_2
https://docs.google.com/presentation/d/1y2ObdZUtMFp2Z0EeyUu1cm469EQNMBjJbhjKiHwSeQo/edit#slide=id.p

It relies on #1870
Fixes: kubernetes/kubernetes#84869

Issue: kubernetes/kubernetes#84869
Signed-off-by: Alexey Perevalov alexey.perevalov@huawei.com

k8s-ci-robot · 2020-06-12T08:33:24Z

Welcome @AlexeyPerevalov!

It looks like this is your first PR to kubernetes/enhancements 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/enhancements has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

k8s-ci-robot · 2020-06-12T08:33:25Z

Hi @AlexeyPerevalov. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

ahg-g · 2020-06-12T09:07:29Z

keps/sig-scheduling/20200612-deducted-topology-manager.md

+  - "@AlexeyPerevalov"
+owning-sig: sig-scheduling
+participating-sigs:
+reviewers:


please add me as reviewer and approver, we also need a reviewer from a sig node lead.

ahg-g · 2020-06-12T10:02:56Z

keps/sig-scheduling/20200612-deducted-topology-manager.md

+
+## Non-Goals
+
+-   Do not change Topology Manager behaviour to be able to work with policy in


I guess you meant to say changing topology manager behavior is a non goal, and so the phrasing should be like this:

Suggested change

- Do not change Topology Manager behaviour to be able to work with policy in

- Change Topology Manager behaviour to be able to work with policy in

would this be more descriptive of this non-goal: "Change the PodSpec to allow requesting a specific node topology manager policy"?

yes, it would be. Thank you.

ahg-g · 2020-06-12T10:18:26Z

keps/sig-scheduling/20200612-deducted-topology-manager.md

+       }
+       if bitmask.IsEmpty() {
+	       // we can't align container, so we can't align a pod
+	       return framework.NewStatus(framework.Error, fmt.Sprintf("Can't align container: %s", container.Name))


Suggested change

return framework.NewStatus(framework.Error, fmt.Sprintf("Can't align container: %s", container.Name))

return framework.NewStatus(framework.Unschedulable, fmt.Sprintf("Can't align container: %s", container.Name))

keps/sig-scheduling/20200612-deducted-topology-manager.md

ahg-g · 2020-06-12T10:26:13Z

keps/sig-scheduling/20200612-deducted-topology-manager.md

+       for resource, quantity := range container.Resources.Requests {
+	       resourceBitmask := bm.NewEmptyBitMask()
+	       if guarantedQoS(&container.Resources.Limits, resource, quantity) {
+		       for numaIndex, numaNodeResources := range numaMap {


can we change this to match what you are proposing to add to NodeInfo (lines 100-105): a list of NUMANodeResources, and the index is NUMAID

yes, sure, it would be better. It was intermediate map, since I tried several approaches, like prefixes with numa%d in ResourceName. Now it's not necessary, especially it may confuse here in proposal.

ahg-g · 2020-06-12T10:27:56Z

keps/sig-scheduling/20200612-deducted-topology-manager.md

+Available resources with topology of the node should be stored in CRD. Format of the topology described
+[in this document](https://docs.google.com/document/d/12kj3fK8boNuPNqob6F_pPU9ZTaNEnPGaXEooW1Cilwg/edit).


who is going to maintain this CRD? sig-node?

CRD was suggested as one of the approach during discussion in sig-node meeting. And interested persons (from RedHat) were involved to discussion of CRD format. Since we plan to create/update CRD not directly from kubelet, but from separate daemon (implemented as daemon set), I think the authors of it will maintain CRD too, including me.

Evolving the API spec from a CRD is a good starting point, at some time when the spec is mature, we can merge it back to a core API. RuntimeClass is a good example - it incubated as a CRD in alpha phase, and then merged back into upstream in the beta phase.

ahg-g · 2020-06-12T10:33:39Z

keps/sig-scheduling/20200612-deducted-topology-manager.md

+Node label contains the name of the topology policy currently implemented in kubelet.
+
+Proposed Node Label may look like this:
+  `beta.kubernetes.io/topology=none|best-effort|restricted|single-numa-node`


this will be confusing, we already have topology.kubernetes.io: https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/api/core/v1/well_known_labels.go#L22:1

We should run this by sig API Machinery to define a label.

also, why do we need this label at all? can't we have the policy in the CRD?

CRD describes node, so yes it's good idea to keep it there.

+1 to make this info described by CRD.

ahg-g · 2020-06-12T10:35:11Z

keps/sig-scheduling/20200612-deducted-topology-manager.md

+-   This Proposal requires exposing NUMA topology information. This KEP doesn't
+describe how to expose all necessary information it just declare what kind of
+information is necessary.


but this is a blocker to having the scheduler work done, so I think we need both KEPs approved at the same time.

We tried several approaches to export topology from the worker node, some of them required kubelet modification. And now we came to conclusion what it's better to avoid kubelet modification. There are two feasiable approaches: collect resources on the CRI level and using kubelet podresources interface (unix domain socket, but now it doesn't provide cpumanager information, only resources of the device plugin).
Here I agree, need a detailed description how it would be implemented. Probably KEP should be in sig-node, but as I mentioned before implementation should not touch kubelet.

ahg-g · 2020-06-12T11:00:34Z

keps/sig-scheduling/20200612-deducted-topology-manager.md

+The algorithm which implements single-numa-node policy is following:
+
+```go
+for _, container := range containers {


I guess this code is also executed by kubelet/topologyManager, note that the scheduler can't take a dependency on kubelet, and so I suggest this logic be extracted into a pkg in staging that both kubelet and the scheduler import.

original logic of TopologyManager has high runtime complexity, that's why here is simplified version. Maybe the best way would be moving whole TopologyManager to staging, but reusing it as is - impossible.
When I just started this task, I thought to move whole logic of TopologyManager into kube-scheduler, but it requires:

factoring out TopologyManager totally as well as depended managers ( CPUManager/DeviceManager)

need to move CPUManager/DeviceManager too. It requires sufficient changes in kubelet's API and probably impossible now.

Huang-Wei · 2020-06-12T17:14:30Z

keps/sig-scheduling/20200612-deducted-topology-manager.md

+of resources in that topology became actual. Pod could be scheduled on the node
+where total amount of resources are enough, but resource distribution could not
+satisfy the appropriate Topology policy. In this case the pod failed to start. Much
+better behaviour for scheduler would be to select appropriate node where admit


I suppose "admit handlers" is kubelet terms? If so, let's say:

Suggested change

better behaviour for scheduler would be to select appropriate node where admit

better behaviour for scheduler would be to select appropriate node where kubelet admit

Huang-Wei · 2020-06-12T17:15:07Z

keps/sig-scheduling/20200612-deducted-topology-manager.md

+
+## Goals
+
+-   Make scheduling process more precise when we have NUMA topology on the


Remove extra spaces. (applies elsewhere)

Suggested change

- Make scheduling process more precise when we have NUMA topology on the

- Make scheduling process more precise when we have NUMA topology on the

Huang-Wei · 2020-06-12T17:47:17Z

keps/sig-scheduling/20200612-deducted-topology-manager.md

+Plugin checks the ability to run pod only in case of single-numa-node policy on the
+node, since it is the most strict policy, it implies that the launch on the node with
+other existing policies will be successful if the condition for single-numa-node policy passed for the worker node.
+Proposed plugin will use node label to identify which topology policy is


My understanding is like we prefer to use CRD to describe whether the node has been enabled single-numa-node, as well as more numaMap info, right? If so, let's update the wordings here.

right, will be updated.

Huang-Wei · 2020-06-12T18:17:44Z

keps/sig-scheduling/20200612-deducted-topology-manager.md

+	       if resourceBitmask.IsEmpty() {
+		       continue
+	       }
+	       bitmask.And(resourceBitmask)


Should we break below to return early? (Unless we perfer to a full bitmask to log a verbosed scheduling failure)

keps/sig-scheduling/20200612-deducted-topology-manager.md

Huang-Wei · 2020-06-12T18:23:56Z

keps/sig-scheduling/20200612-deducted-topology-manager.md

+Node label contains the name of the topology policy currently implemented in kubelet.
+
+Proposed Node Label may look like this:
+  `beta.kubernetes.io/topology=none|best-effort|restricted|single-numa-node`


+1 to make this info described by CRD.

Huang-Wei · 2020-06-12T18:25:56Z

keps/sig-scheduling/20200612-deducted-topology-manager.md

+Available resources with topology of the node should be stored in CRD. Format of the topology described
+[in this document](https://docs.google.com/document/d/12kj3fK8boNuPNqob6F_pPU9ZTaNEnPGaXEooW1Cilwg/edit).


Evolving the API spec from a CRD is a good starting point, at some time when the spec is mature, we can merge it back to a core API. RuntimeClass is a good example - it incubated as a CRD in alpha phase, and then merged back into upstream in the beta phase.

Huang-Wei · 2020-06-12T18:29:06Z

keps/sig-scheduling/20200612-deducted-topology-manager.md

+information is necessary.
+
+# Proposal
+Kube-scheduler builtin plugin will be added to the main tree. This plugin


Add a blank line above.

IMO there are 2 parts in scheduler side:

In-tree changes on internal data structures to accommodate NUMAMap info, which will then exposed by the scheduler framework handle (SnapshotSharedLister).

A new scheduler plugin to honor the NUMAMap info so that the scheduling decision is aligned with kubelet - we can discuss later whether we want to put it in-tree or out-of-tree.

Huang-Wei · 2020-06-12T18:30:09Z

keps/sig-scheduling/20200612-deducted-topology-manager.md

@@ -0,0 +1,177 @@
+---
+title: Deducted version of TopologyManager in kube-scheduler


Is "Tailored" more proper? (Not an English expert though..)

node-topology aware scheduling

The meaning here was that we implement reduced/simplified version of topology manager.
Yes maybe better Node-topology aware scheduling - it's both a specific and an abstract.

rphillips · 2020-06-16T17:33:06Z

@vpickard could you comment on this?

kad · 2020-06-16T17:35:01Z

Sorry, but I'm opposing this approach. Here is reasons why:

Only using NUMA id can't identify resources uniquely.
assumptions from simplest server configuration x86 + 2 socket, uniform CPUs and memory
a lot of assumptions that NUMA nodes are equal: no distance/costs information, no linkage of memory-only nodes, heterogenous memory types, ...

It is really, not the good idea to start introducing hardware architecture specifics and assumptions into scheduler.

ffromani · 2020-06-16T17:42:22Z

Sorry, but I'm opposing this approach. Here is reasons why:
1. Only using NUMA id can't identify resources uniquely.

2. assumptions from simplest server configuration x86 + 2 socket, uniform CPUs and memory

3. a lot of assumptions that NUMA nodes are equal: no distance/costs information, no linkage of memory-only nodes, heterogenous memory types, ...
It is really, not the good idea to start introducing hardware architecture specifics and assumptions into scheduler.

What could be a better alternative, in your opinion, to avoid the fundamental issue this KEP is trying to address, which is pods ending up in Topology Affinity Error, because the scheduler picks a node and then the Topology Manager in that node fails to properly align the requested resources?

kad · 2020-06-16T18:41:54Z

What could be a better alternative, in your opinion, to avoid the fundamental issue this KEP is trying to address, which is pods ending up in Topology Affinity Error, because the scheduler picks a node and then the Topology Manager in that node fails to properly align the requested resources?

It really depends on what we really want to solve. In my opinion, there is need on scheduler level to solve fundamental error condition that can happen to any Pod, not only due to Topology Manager:

Precondition: node is in good healthy state, serving workloads
Scheduler assigns Pod to the node
Any of the errors related to particular Pod or to one or more containers within that Pod might occur:
- requested Pod storage might have troubles to be attached to this node. (no affects to other pods)
- CRI (container runtime) returned errors during RunPodSandbox or CreateContainer (again, those errors usually not affecting other workloads, because it can be caused by various error scenarios in Pod config or container config, like incorrect seccomp annotations or similar issues).
- errors related to pulling images for this Pod (permission denied, timeouts to registry, decryption key unavailability on that node, and similar issues that are specific to Pod/Container, not globally for the whole node)
- TopologyManager might not admit pod
- network plugin failed for pod...
- other potential single Pod or Container issues that don't indicate that whole node is unhealthy, but particular pod/container might not be run here.
At this stage, if the pod is in error state and node is in healthy state, we have two options:
1. keep trying to re-run this failed pod on the same node, indicating error states.
2. a way to inform scheduler to re-schedule Pod somewhere else.
3. combination: of two above: Scheduler can re-check availability of alternatives to run this pod, and if there are other candidate nodes in the list, re-schedule pod on one of those nodes. If there are no other node alternatives, keep it assigned in crashloop on current node.

Second problem of "aligning resources": this problem is complex, and can't be assumed just because there is more common pattern of building servers or what actually for particular workload will be optimal "alignment". For some workloads it will require combination of CPUs / Memory Nodes and particular PCI devices aligned. For some memory might be not critical, as workload is not memory intensive and only CPU vs. PCI device matters. For some it might require memory from more than one region, etc... Exposing all of those conditions on scheduler level is not that simple task, and generally not needed for many of k8s setups where nodes are not exposing real hardware topology. If we want to go to that path, there are some fundamental changes that need to be done in kubelet on how to properly expose and count resources (and collect information about availability of such). There are some blueprints of that kind in virtual kubelet project, it might require changes in CRI apis and on runtimes side...

So, I'd really suggest to focus on first fundamental issue mentioned above: "Pod is scheduled on the node, some (potentially) unrecoverable error for this particular during creation/starting occur. How we gracefully re-schedule it somewhere to another candidate node or properly return error to user if we can't run this pod anywhere else in the cluster".

AlexeyPerevalov · 2020-06-17T06:53:18Z

Sorry, but I'm opposing this approach. Here is reasons why:

Only using NUMA id can't identify resources uniquely.

assumptions from simplest server configuration x86 + 2 socket, uniform CPUs and memory

a lot of assumptions that NUMA nodes are equal: no distance/costs information, no linkage of memory-only nodes, heterogenous memory types, ...

It is really, not the good idea to start introducing hardware architecture specifics and assumptions into scheduler.
If I truly understood all Alexanders points and all requirements we need to move TopologyManager/CPUManager/DeviceManager and so on into scheduler or another dedicated daemon (OpenStack did that in placement, but they came to that not from the first step), but it's not possible right now, because of many reasons like: high runtime complexity of the TopologyManager, requirements to expose DevicePlugins datum from the node.

ffromani · 2020-06-17T07:31:56Z

Thanks @kad for the deep and insightful response. A lot to unpack here. Now let me try to expand and clarify some point in order to (hopefully) make the further conversation easier.

What could be a better alternative, in your opinion, to avoid the fundamental issue this KEP is trying to address, which is pods ending up in Topology Affinity Error, because the scheduler picks a node and then the Topology Manager in that node fails to properly align the requested resources?

It really depends on what we really want to solve. In my opinion, there is need on scheduler level to solve fundamental error condition that can happen to any Pod, not only due to Topology Manager:
1. Precondition: node is in good healthy state, serving workloads

Absolutely

2. Scheduler assigns Pod to the node

3. Any of the errors related to particular Pod or to one or more containers within that Pod might occur:
   
   * requested Pod storage might have troubles to be attached to this node. (no affects to other pods)
   * CRI (container runtime) returned errors during `RunPodSandbox` or `CreateContainer` (again, those errors usually not affecting other workloads, because it can be caused by various error scenarios in Pod config or container config, like incorrect seccomp annotations or similar issues).
   * errors related to pulling images for this Pod (permission denied, timeouts to registry, decryption key unavailability on that node, and similar issues that are specific to Pod/Container, not globally for the whole node)
   * TopologyManager might not admit pod
   * network plugin failed for pod...
   * other potential single Pod or Container issues that don't indicate that whole node is unhealthy, but particular pod/container might not be run here.

4. At this stage, if the pod is in error state and node is in healthy state, we have two options:
   
   1. keep trying to re-run this failed pod on the same node, indicating error states.
   2. a way to inform scheduler to re-schedule Pod somewhere else.
   3. combination: of two above:  Scheduler can re-check availability of alternatives to run this pod, and if there are other candidate nodes in the list, re-schedule pod on one of those nodes. If there are no other node alternatives, keep it assigned in crashloop on current node.

This seems to suggest that we should not treat Topology Affinity Errors in a special way: they are yet another instance of node-related pod admission failures, no extra care needed in the scheduler.
Is this a fair summarization?

The approach described above solves the fundamental issue some of us (myself included!) have with the current behaviour of k8s, on which a pod rejected by Topology Manager ends up in error/TopologyAffinityError. This is suboptimal and should be improved.

The drawback I can see in the approach you described above is is that the pod can spend quite some time waiting to be scheduled, because

the scheduler, having no knowledge about HW specifics, picks nodes randomly. Here "randomly" means that the selected node is not more likely than others to admit the pod. The scheduler just doesn't know.
the randomly picked node rejects pod admission because topology manager said so.
the scheduler somehow learns about the pod needs to be re-scheduled
the scheduler picks another "random" node, and the cycle begins anew

So the issue is there is no guarantee whatsoever regarding how long takes for a pod to actually get running, even if we know ahead of time the cluster can actually run the pod. The pod can just be subject to an unlucky streak of random picks.

Second problem of "aligning resources": this problem is complex, and can't be assumed just because there is more common pattern of building servers or what actually for particular workload will be optimal "alignment". For some workloads it will require combination of CPUs / Memory Nodes and particular PCI devices aligned. For some memory might be not critical, as workload is not memory intensive and only CPU vs. PCI device matters. For some it might require memory from more than one region, etc... Exposing all of those conditions on scheduler level is not that simple task, and generally not needed for many of k8s setups where nodes are not exposing real hardware topology. If we want to go to that path, there are some fundamental changes that need to be done in kubelet on how to properly expose and count resources (and collect information about availability of such). There are some blueprints of that kind in virtual kubelet project, it might require changes in CRI apis and on runtimes side...

Makes sense to me. I can see why this problem is complex and hard. So as general direction it seems better to keep the knowledge of the HW details/topology on the node, and not propagate these details in the cluster, right?

In other words, from resource assignment perspective, if we keep this check in the kubelet, we will never know if a node can actually run a given pod (/container) unless we try to schedule on that specific node and we let Topology Manager (or any other node component which knows about all these detials) do this check.

This is not necessarily a problem, but if we all as community we agree that this is the right direction, then this becomes a pretty strong constraint we all should be well aware of, so I'm taking the chance to write it down explicitely :)

So, I'd really suggest to focus on first fundamental issue mentioned above: "Pod is scheduled on the node, some (potentially) unrecoverable error for this particular during creation/starting occur. How we gracefully re-schedule it somewhere to another candidate node or properly return error to user if we can't run this pod anywhere else in the cluster".

This is an approach we talked about (internally) like a couple month ago. Besides the unpredictable scheduling delay I described above, nothing really wrong here from my perspective, and could be a nice first step.
But this approach requires (like you described above)

a way to inform scheduler to re-schedule Pod somewhere else.
and, to avoid looping on the same node,
a way for the scheduler to remember somehow which nodes rejected the pod, in order to be able to try new nodes

Does this look right?
In this framework, my understanding is the basic concept behind this KEP was to move a step further, avoiding the scheduler retry loop and trying to pick the right node (= a node whose Topology manager is most likely to admit the pod) already on the first try.

Huang-Wei · 2020-06-19T01:00:33Z

In general, what we have discussed here falls into 2 directions:

Use scheduler to do the topology-related computation, although it's sort of a simplified version - this is what this KEP suggests.
As @kad mentioned, "schedule-propose-but-kubelet-reject" failure is a general problem, putting too much HW specific calculation in scheduler doesn't look quite good. Probably we should come up with more general mechanics to learn from this "schedule-propose-but-kubelet-reject" failure, and then hopefully the pod will end up landing on an admitted node.

A 3rd option comes into my mind which combines the above 2 ones. Here is a very rough idea: scheduler doesn't add in topology-related computation; instead, it suggests more than 1 node (if there are) for pod placement by modifying the node's .status.nominatedNodeName field (we may need a new field) to notify nominated nodes to take further checks to see they can admit this pod. If more than one node can accommodate this pod, they compete for running this pod by updating .spec.nodeName to itself. And only the first accommodation request will succeed, the other ones failed due to APIResourceConflict errors.

BTW: to avoid compute overhead on every potential node, the suggested number should be configured, such as defaulting to 5.

AlexeyPerevalov · 2020-06-19T07:16:01Z

In general, what we have discussed here falls into 2 directions:

Use scheduler to do the topology-related computation, although it's sort of a simplified version - this is what this KEP suggests.

As @kad mentioned, "schedule-propose-but-kubelet-reject" failure is a general problem, putting too much HW specific calculation in scheduler doesn't look quite good. Probably we should come up with more general mechanics to learn from this "schedule-propose-but-kubelet-reject" failure, and then hopefully the pod will end up landing on an admitted node.

A 3rd option comes into my mind which combines the above 2 ones. Here is a very rough idea: scheduler doesn't add in topology-related computation; instead, it suggests more than 1 node (if there are) for pod placement by modifying the node's .status.nominatedNodeName field (we may need a new field) to notify nominated nodes to take further checks to see they can admit this pod. If more than one node can accommodate this pod, they compete for running this pod by updating .spec.nodeName to itself. And only the first accommodation request will succeed, the other ones failed due to APIResourceConflict errors.

I have a question regarding 3rd option:

scheduler doesn't add in topology-related computation; instead, it suggests more than 1 node - these more than 1 node is nominated nodes, so the way we found it is based just on pkg/scheduler/framework/plugins/noderesources/fit.go
notify nominated nodes - does it mean we call kubelet admit handler from kube-scheduler or it's yet another new (additional and optional) stage of scheduling process?

BTW: to avoid compute overhead on every potential node, the suggested number should be configured, such as defaulting to 5.

ffromani · 2020-06-19T08:37:12Z

In general, what we have discussed here falls into 2 directions:
[...]
* As @kad mentioned, "schedule-propose-but-kubelet-reject" failure is a general problem, putting too much HW specific calculation in scheduler doesn't look quite good. Probably we should come up with more general mechanics to learn from this "schedule-propose-but-kubelet-reject" failure, and then hopefully the pod will end up landing on an admitted node.

There it was some initial talking about this approach. The initial idea was something along these lines. The biggest problem was the existing controllers, like in this scenario:

a Pod gets scheduled, lands on a node, fails admission by Topology Manager, kubelet sets it as failed for TopologyAffinityError
a controller sees the Pod failed, reschedules it, but pod still fails admission, so failed pods pile up rally fast
I think @vpickard and @swatisehgal can add more details about this scenario
So we likely need a way to handle this scenario, possibly without changing all the existing controllers.

A 3rd option comes into my mind which combines the above 2 ones. Here is a very rough idea: scheduler doesn't add in topology-related computation; instead, it suggests more than 1 node (if there are) for pod placement by modifying the node's .status.nominatedNodeName field (we may need a new field) to notify nominated nodes to take further checks to see they can admit this pod. If more than one node can accommodate this pod, they compete for running this pod by updating .spec.nodeName to itself. And only the first accommodation request will succeed, the other ones failed due to APIResourceConflict errors.

BTW: to avoid compute overhead on every potential node, the suggested number should be configured, such as defaulting to 5.

Looks neat! However, how should the competition be regulated? IOW, how this model interacts with configured replicas? Let's say I run a deployment which wants exactly one of my pod, I think it may very well happen in this scenario that, say, three nodes try to run my pod, only one wins (obviously) and keep the pod going, and the losers notice and silently kill their pods. Is my understanding right?

Huang-Wei · 2020-06-19T20:24:05Z

Re @AlexeyPerevalov:

scheduler doesn't add in topology-related computation; instead, it suggests more than 1 node - these more than 1 node is nominated nodes, so the way we found it is based just on pkg/scheduler/framework/plugins/noderesources/fit.go

Yes, as well as other existing Filtering constraints.

notify nominated nodes - does it mean we call kubelet admit handler from kube-scheduler or it's yet another new (additional and optional) stage of scheduling process?

Nope, there is still no interaction between kube-scheduler and kuebelet. Here is how it works: now only one kubelet "claims" the pod (with .spec.nodName set by scheduler), so it's likely that admit handler may fail it. However, if multiple kubelets "pre-claim" the pod, and hopefully at least one can admit the pod, or several kubelets competes in "claim"ing the pod.

Re @fromanirh, I understand both option 2 & 3 are very rough ideas, but would be good to brainstorm here.

Looks neat! However, how should the competition be regulated? IOW, how this model interacts with configured replicas? Let's say I run a deployment which wants exactly one of my pod, I think it may very well happen in this scenario that, say, three nodes try to run my pod, only one wins (obviously) and keep the pod going, and the losers notice and silently kill their pods. Is my understanding right?

Not really. There is always one pod for each replica. It's just the logic gets changed from "one kubelet admit and run the pod" to "N kubelets admit and compete to run the pod". So there is no concept of "kill their pod" as there is only one pod. Only one kubelet winner will be able to change the pod's ".spec.nodeName" to its node name, that's it.

ahg-g · 2020-06-19T20:53:12Z

The problem with providing multiple node nominations is that it still doesn't guarantee that a node will admit the pod, moreover we get into a tuning issue where we need to select how many nodes to nominate, which will likely differ by cluster size, current cluster utilization etc.

AlexeyPerevalov · 2020-06-21T19:51:20Z

notify nominated nodes - does it mean we call kubelet admit handler from kube-scheduler or it's yet another new (additional and optional) stage of scheduling process?

Nope, there is still no interaction between kube-scheduler and kuebelet. Here is how it works: now only one kubelet "claims" the pod (with .spec.nodName set by scheduler), so it's likely that admit handler may fail it. However, if multiple kubelets "pre-claim" the pod, and hopefully at least one can admit the pod, or several kubelets competes in "claim"ing the pod.

It looks like new stage of scheduling. kube-scheduler says these nodes are nominated and nodes are going to run its admit handler, if it passed kubelet on the node change e.g. pod's status to ReadyToRunOnTheNode or AdmitHandlerPassed, kube-scheduler tracks it down and changes pod's spec.nodeName to appropriate node.

But such behavior is necessary only for nodes with enabled TopologyManager and only for such nodes scheduling will be little bit longer.

AlexeyPerevalov · 2020-06-21T19:52:44Z

The problem with providing multiple node nominations is that it still doesn't guarantee that a node will admit the pod, moreover we get into a tuning issue where we need to select how many nodes to nominate, which will likely differ by cluster size, current cluster utilization etc.

I think maximum number of nominated nodes could be defined in configuration or yes evaluated dynamically.

As part of the enablement of Topology-aware scheduling in kubernetes, the scheduler plugin has been proposed as in-tree. KEP: kubernetes/enhancements#1858 PR: kubernetes/kubernetes#90708 To enable faster development velocity, testing and community adoption we are also packaging as an out of tree scheduler plugin. This out-of-tree implementation is based on the above PR and KEP.

herb-duan · 2021-01-07T12:51:20Z

m

AlexeyPerevalov · 2021-01-25T15:22:38Z

We decided to focus on out-of-tree plugin kubernetes-sigs/scheduler-plugins#119. This KEP, will be postponed until noderesourcetopology-api not in the staging or built-in.
/hold
or maybe close (and reopen, when use case of the private clouds on bare-metal on-premise will be more widely spread)

fejta-bot · 2021-04-25T15:39:00Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale

fejta-bot · 2021-05-25T15:57:33Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten

AlexeyPerevalov · 2021-05-25T18:33:03Z

/close
in favor to kubernetes-sigs/scheduler-plugins#119

k8s-ci-robot · 2021-05-25T18:33:15Z

@AlexeyPerevalov: Closed this PR.

In response to this:

/close
in favor to kubernetes-sigs/scheduler-plugins#119

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

- This KEP consolidates the following two KEPs into one - kubernetes#1858 - kubernetes#1870 - Also the KEP talks about introducing NodeResourceTopology as a native Kubernetes resource. Signed-off-by: Swati Sehgal <swsehgal@redhat.com>

- This KEP consolidates the following two KEPs into one - kubernetes#1858 - kubernetes#1870 - Also the KEP talks about introducing NodeResourceTopology as a native Kubernetes resource. Co-authored-by: Alexey Perevalov <alexey.perevalov@huawei.com> Signed-off-by: Swati Sehgal <swsehgal@redhat.com>

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jun 12, 2020

k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Jun 12, 2020

k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory labels Jun 12, 2020

k8s-ci-robot requested a review from ahg-g June 12, 2020 08:33

k8s-ci-robot added the sig/architecture Categorizes an issue or PR as relevant to SIG Architecture. label Jun 12, 2020

k8s-ci-robot requested a review from Huang-Wei June 12, 2020 08:33

k8s-ci-robot added the sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. label Jun 12, 2020

ahg-g reviewed Jun 12, 2020

View reviewed changes

Huang-Wei reviewed Jun 12, 2020

View reviewed changes

AlexeyPerevalov force-pushed the DeductedTopologyManager branch from 8c263bd to 1c2dce0 Compare June 21, 2020 19:24

AlexeyPerevalov changed the title ~~Deducted version of topology manager in kube-scheduler~~ Simplified version of topology manager in kube-scheduler Jun 21, 2020

AlexeyPerevalov mentioned this pull request Jun 22, 2020

Topology aware resource provisioning daemon #1870

Closed

AlexeyPerevalov marked this pull request as ready for review November 6, 2020 08:12

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 6, 2020

AlexeyPerevalov mentioned this pull request Nov 30, 2020

Topology aware scheduler plugin kubernetes-sigs/scheduler-plugins#119

Merged

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 25, 2021

AlexeyPerevalov mentioned this pull request Jan 29, 2021

Topology aware scheduler per numa kubernetes-sigs/scheduler-plugins#143

Merged

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 25, 2021

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels May 25, 2021

k8s-ci-robot closed this May 25, 2021

swatisehgal mentioned this pull request Jun 17, 2021

Topology awareness in Kube-scheduler #2787

Closed

swatisehgal mentioned this pull request Sep 30, 2021

REQUEST: New kubernetes-sigs membership for @swatisehgal kubernetes/org#3023

Closed

7 tasks


		## Non-Goals

		- Do not change Topology Manager behaviour to be able to work with policy in

	- Do not change Topology Manager behaviour to be able to work with policy in
	- Change Topology Manager behaviour to be able to work with policy in

	return framework.NewStatus(framework.Error, fmt.Sprintf("Can't align container: %s", container.Name))
	return framework.NewStatus(framework.Unschedulable, fmt.Sprintf("Can't align container: %s", container.Name))

		Available resources with topology of the node should be stored in CRD. Format of the topology described
		[in this document](https://docs.google.com/document/d/12kj3fK8boNuPNqob6F_pPU9ZTaNEnPGaXEooW1Cilwg/edit).

	better behaviour for scheduler would be to select appropriate node where admit
	better behaviour for scheduler would be to select appropriate node where kubelet admit


		## Goals

		- Make scheduling process more precise when we have NUMA topology on the

		@@ -0,0 +1,177 @@
		---
		title: Deducted version of TopologyManager in kube-scheduler

Simplified version of topology manager in kube-scheduler #1858

Simplified version of topology manager in kube-scheduler #1858

Conversation

AlexeyPerevalov commented Jun 12, 2020 • edited Loading

k8s-ci-robot commented Jun 12, 2020

k8s-ci-robot commented Jun 12, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rphillips commented Jun 16, 2020

kad commented Jun 16, 2020

ffromani commented Jun 16, 2020

kad commented Jun 16, 2020

AlexeyPerevalov commented Jun 17, 2020

ffromani commented Jun 17, 2020

Huang-Wei commented Jun 19, 2020 • edited Loading

AlexeyPerevalov commented Jun 19, 2020

ffromani commented Jun 19, 2020

Huang-Wei commented Jun 19, 2020

ahg-g commented Jun 19, 2020 • edited Loading

AlexeyPerevalov commented Jun 21, 2020

AlexeyPerevalov commented Jun 21, 2020

herb-duan commented Jan 7, 2021

AlexeyPerevalov commented Jan 25, 2021

fejta-bot commented Apr 25, 2021

fejta-bot commented May 25, 2021

AlexeyPerevalov commented May 25, 2021

k8s-ci-robot commented May 25, 2021

AlexeyPerevalov commented Jun 12, 2020 •

edited

Loading

Huang-Wei commented Jun 19, 2020 •

edited

Loading

ahg-g commented Jun 19, 2020 •

edited

Loading