Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simplified version of topology manager in kube-scheduler #1858

Conversation

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jun 12, 2020
@k8s-ci-robot
Copy link
Contributor

Welcome @AlexeyPerevalov!

It looks like this is your first PR to kubernetes/enhancements 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/enhancements has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Jun 12, 2020
@k8s-ci-robot
Copy link
Contributor

Hi @AlexeyPerevalov. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory labels Jun 12, 2020
@k8s-ci-robot k8s-ci-robot requested a review from ahg-g June 12, 2020 08:33
@k8s-ci-robot k8s-ci-robot added the sig/architecture Categorizes an issue or PR as relevant to SIG Architecture. label Jun 12, 2020
@k8s-ci-robot k8s-ci-robot added the sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. label Jun 12, 2020
- "@AlexeyPerevalov"
owning-sig: sig-scheduling
participating-sigs:
reviewers:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add me as reviewer and approver, we also need a reviewer from a sig node lead.


## Non-Goals

- Do not change Topology Manager behaviour to be able to work with policy in
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess you meant to say changing topology manager behavior is a non goal, and so the phrasing should be like this:

Suggested change
- Do not change Topology Manager behaviour to be able to work with policy in
- Change Topology Manager behaviour to be able to work with policy in

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would this be more descriptive of this non-goal: "Change the PodSpec to allow requesting a specific node topology manager policy"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, it would be. Thank you.

}
if bitmask.IsEmpty() {
// we can't align container, so we can't align a pod
return framework.NewStatus(framework.Error, fmt.Sprintf("Can't align container: %s", container.Name))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return framework.NewStatus(framework.Error, fmt.Sprintf("Can't align container: %s", container.Name))
return framework.NewStatus(framework.Unschedulable, fmt.Sprintf("Can't align container: %s", container.Name))

keps/sig-scheduling/20200612-deducted-topology-manager.md Outdated Show resolved Hide resolved
for resource, quantity := range container.Resources.Requests {
resourceBitmask := bm.NewEmptyBitMask()
if guarantedQoS(&container.Resources.Limits, resource, quantity) {
for numaIndex, numaNodeResources := range numaMap {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we change this to match what you are proposing to add to NodeInfo (lines 100-105): a list of NUMANodeResources, and the index is NUMAID

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, sure, it would be better. It was intermediate map, since I tried several approaches, like prefixes with numa%d in ResourceName. Now it's not necessary, especially it may confuse here in proposal.

Comment on lines 86 to 87
Available resources with topology of the node should be stored in CRD. Format of the topology described
[in this document](https://docs.google.com/document/d/12kj3fK8boNuPNqob6F_pPU9ZTaNEnPGaXEooW1Cilwg/edit).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

who is going to maintain this CRD? sig-node?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CRD was suggested as one of the approach during discussion in sig-node meeting. And interested persons (from RedHat) were involved to discussion of CRD format. Since we plan to create/update CRD not directly from kubelet, but from separate daemon (implemented as daemon set), I think the authors of it will maintain CRD too, including me.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Evolving the API spec from a CRD is a good starting point, at some time when the spec is mature, we can merge it back to a core API. RuntimeClass is a good example - it incubated as a CRD in alpha phase, and then merged back into upstream in the beta phase.

Node label contains the name of the topology policy currently implemented in kubelet.

Proposed Node Label may look like this:
`beta.kubernetes.io/topology=none|best-effort|restricted|single-numa-node`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will be confusing, we already have topology.kubernetes.io: https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/api/core/v1/well_known_labels.go#L22:1

We should run this by sig API Machinery to define a label.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, why do we need this label at all? can't we have the policy in the CRD?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CRD describes node, so yes it's good idea to keep it there.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to make this info described by CRD.

Comment on lines 57 to 59
- This Proposal requires exposing NUMA topology information. This KEP doesn't
describe how to expose all necessary information it just declare what kind of
information is necessary.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but this is a blocker to having the scheduler work done, so I think we need both KEPs approved at the same time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We tried several approaches to export topology from the worker node, some of them required kubelet modification. And now we came to conclusion what it's better to avoid kubelet modification. There are two feasiable approaches: collect resources on the CRI level and using kubelet podresources interface (unix domain socket, but now it doesn't provide cpumanager information, only resources of the device plugin).
Here I agree, need a detailed description how it would be implemented. Probably KEP should be in sig-node, but as I mentioned before implementation should not touch kubelet.

The algorithm which implements single-numa-node policy is following:

```go
for _, container := range containers {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this code is also executed by kubelet/topologyManager, note that the scheduler can't take a dependency on kubelet, and so I suggest this logic be extracted into a pkg in staging that both kubelet and the scheduler import.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

original logic of TopologyManager has high runtime complexity, that's why here is simplified version. Maybe the best way would be moving whole TopologyManager to staging, but reusing it as is - impossible.
When I just started this task, I thought to move whole logic of TopologyManager into kube-scheduler, but it requires:

  1. factoring out TopologyManager totally as well as depended managers ( CPUManager/DeviceManager)
  2. need to move CPUManager/DeviceManager too. It requires sufficient changes in kubelet's API and probably impossible now.

of resources in that topology became actual. Pod could be scheduled on the node
where total amount of resources are enough, but resource distribution could not
satisfy the appropriate Topology policy. In this case the pod failed to start. Much
better behaviour for scheduler would be to select appropriate node where admit
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose "admit handlers" is kubelet terms? If so, let's say:

Suggested change
better behaviour for scheduler would be to select appropriate node where admit
better behaviour for scheduler would be to select appropriate node where kubelet admit


## Goals

- Make scheduling process more precise when we have NUMA topology on the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove extra spaces. (applies elsewhere)

Suggested change
- Make scheduling process more precise when we have NUMA topology on the
- Make scheduling process more precise when we have NUMA topology on the

Plugin checks the ability to run pod only in case of single-numa-node policy on the
node, since it is the most strict policy, it implies that the launch on the node with
other existing policies will be successful if the condition for single-numa-node policy passed for the worker node.
Proposed plugin will use node label to identify which topology policy is
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is like we prefer to use CRD to describe whether the node has been enabled single-numa-node, as well as more numaMap info, right? If so, let's update the wordings here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, will be updated.

if resourceBitmask.IsEmpty() {
continue
}
bitmask.And(resourceBitmask)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we break below to return early? (Unless we perfer to a full bitmask to log a verbosed scheduling failure)

keps/sig-scheduling/20200612-deducted-topology-manager.md Outdated Show resolved Hide resolved
Node label contains the name of the topology policy currently implemented in kubelet.

Proposed Node Label may look like this:
`beta.kubernetes.io/topology=none|best-effort|restricted|single-numa-node`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to make this info described by CRD.

Comment on lines 86 to 87
Available resources with topology of the node should be stored in CRD. Format of the topology described
[in this document](https://docs.google.com/document/d/12kj3fK8boNuPNqob6F_pPU9ZTaNEnPGaXEooW1Cilwg/edit).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Evolving the API spec from a CRD is a good starting point, at some time when the spec is mature, we can merge it back to a core API. RuntimeClass is a good example - it incubated as a CRD in alpha phase, and then merged back into upstream in the beta phase.

information is necessary.

# Proposal
Kube-scheduler builtin plugin will be added to the main tree. This plugin
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a blank line above.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO there are 2 parts in scheduler side:

  • In-tree changes on internal data structures to accommodate NUMAMap info, which will then exposed by the scheduler framework handle (SnapshotSharedLister).
  • A new scheduler plugin to honor the NUMAMap info so that the scheduling decision is aligned with kubelet - we can discuss later whether we want to put it in-tree or out-of-tree.

@@ -0,0 +1,177 @@
---
title: Deducted version of TopologyManager in kube-scheduler
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is "Tailored" more proper? (Not an English expert though..)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

node-topology aware scheduling

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The meaning here was that we implement reduced/simplified version of topology manager.
Yes maybe better Node-topology aware scheduling - it's both a specific and an abstract.

@rphillips
Copy link
Member

@vpickard could you comment on this?

@kad
Copy link
Member

kad commented Jun 16, 2020

Sorry, but I'm opposing this approach. Here is reasons why:

  1. Only using NUMA id can't identify resources uniquely.
  2. assumptions from simplest server configuration x86 + 2 socket, uniform CPUs and memory
  3. a lot of assumptions that NUMA nodes are equal: no distance/costs information, no linkage of memory-only nodes, heterogenous memory types, ...

It is really, not the good idea to start introducing hardware architecture specifics and assumptions into scheduler.

@ffromani
Copy link
Contributor

Sorry, but I'm opposing this approach. Here is reasons why:

1. Only using NUMA id can't identify resources uniquely.

2. assumptions from simplest server configuration x86 + 2 socket, uniform CPUs and memory

3. a lot of assumptions that NUMA nodes are equal: no distance/costs information, no linkage of memory-only nodes, heterogenous memory types, ...

It is really, not the good idea to start introducing hardware architecture specifics and assumptions into scheduler.

What could be a better alternative, in your opinion, to avoid the fundamental issue this KEP is trying to address, which is pods ending up in Topology Affinity Error, because the scheduler picks a node and then the Topology Manager in that node fails to properly align the requested resources?

@kad
Copy link
Member

kad commented Jun 16, 2020

What could be a better alternative, in your opinion, to avoid the fundamental issue this KEP is trying to address, which is pods ending up in Topology Affinity Error, because the scheduler picks a node and then the Topology Manager in that node fails to properly align the requested resources?

It really depends on what we really want to solve. In my opinion, there is need on scheduler level to solve fundamental error condition that can happen to any Pod, not only due to Topology Manager:

  1. Precondition: node is in good healthy state, serving workloads
  2. Scheduler assigns Pod to the node
  3. Any of the errors related to particular Pod or to one or more containers within that Pod might occur:
    • requested Pod storage might have troubles to be attached to this node. (no affects to other pods)
    • CRI (container runtime) returned errors during RunPodSandbox or CreateContainer (again, those errors usually not affecting other workloads, because it can be caused by various error scenarios in Pod config or container config, like incorrect seccomp annotations or similar issues).
    • errors related to pulling images for this Pod (permission denied, timeouts to registry, decryption key unavailability on that node, and similar issues that are specific to Pod/Container, not globally for the whole node)
    • TopologyManager might not admit pod
    • network plugin failed for pod...
    • other potential single Pod or Container issues that don't indicate that whole node is unhealthy, but particular pod/container might not be run here.
  4. At this stage, if the pod is in error state and node is in healthy state, we have two options:
    1. keep trying to re-run this failed pod on the same node, indicating error states.
    2. a way to inform scheduler to re-schedule Pod somewhere else.
    3. combination: of two above: Scheduler can re-check availability of alternatives to run this pod, and if there are other candidate nodes in the list, re-schedule pod on one of those nodes. If there are no other node alternatives, keep it assigned in crashloop on current node.

Second problem of "aligning resources": this problem is complex, and can't be assumed just because there is more common pattern of building servers or what actually for particular workload will be optimal "alignment". For some workloads it will require combination of CPUs / Memory Nodes and particular PCI devices aligned. For some memory might be not critical, as workload is not memory intensive and only CPU vs. PCI device matters. For some it might require memory from more than one region, etc... Exposing all of those conditions on scheduler level is not that simple task, and generally not needed for many of k8s setups where nodes are not exposing real hardware topology. If we want to go to that path, there are some fundamental changes that need to be done in kubelet on how to properly expose and count resources (and collect information about availability of such). There are some blueprints of that kind in virtual kubelet project, it might require changes in CRI apis and on runtimes side...

So, I'd really suggest to focus on first fundamental issue mentioned above: "Pod is scheduled on the node, some (potentially) unrecoverable error for this particular during creation/starting occur. How we gracefully re-schedule it somewhere to another candidate node or properly return error to user if we can't run this pod anywhere else in the cluster".

@AlexeyPerevalov
Copy link
Contributor Author

Sorry, but I'm opposing this approach. Here is reasons why:

  1. Only using NUMA id can't identify resources uniquely.
  2. assumptions from simplest server configuration x86 + 2 socket, uniform CPUs and memory
  3. a lot of assumptions that NUMA nodes are equal: no distance/costs information, no linkage of memory-only nodes, heterogenous memory types, ...

It is really, not the good idea to start introducing hardware architecture specifics and assumptions into scheduler.
If I truly understood all Alexanders points and all requirements we need to move TopologyManager/CPUManager/DeviceManager and so on into scheduler or another dedicated daemon (OpenStack did that in placement, but they came to that not from the first step), but it's not possible right now, because of many reasons like: high runtime complexity of the TopologyManager, requirements to expose DevicePlugins datum from the node.

@ffromani
Copy link
Contributor

Thanks @kad for the deep and insightful response. A lot to unpack here. Now let me try to expand and clarify some point in order to (hopefully) make the further conversation easier.

What could be a better alternative, in your opinion, to avoid the fundamental issue this KEP is trying to address, which is pods ending up in Topology Affinity Error, because the scheduler picks a node and then the Topology Manager in that node fails to properly align the requested resources?

It really depends on what we really want to solve. In my opinion, there is need on scheduler level to solve fundamental error condition that can happen to any Pod, not only due to Topology Manager:

1. Precondition: node is in good healthy state, serving workloads

Absolutely

2. Scheduler assigns Pod to the node

3. Any of the errors related to particular Pod or to one or more containers within that Pod might occur:
   
   * requested Pod storage might have troubles to be attached to this node. (no affects to other pods)
   * CRI (container runtime) returned errors during `RunPodSandbox` or `CreateContainer` (again, those errors usually not affecting other workloads, because it can be caused by various error scenarios in Pod config or container config, like incorrect seccomp annotations or similar issues).
   * errors related to pulling images for this Pod (permission denied, timeouts to registry, decryption key unavailability on that node, and similar issues that are specific to Pod/Container, not globally for the whole node)
   * TopologyManager might not admit pod
   * network plugin failed for pod...
   * other potential single Pod or Container issues that don't indicate that whole node is unhealthy, but particular pod/container might not be run here.

4. At this stage, if the pod is in error state and node is in healthy state, we have two options:
   
   1. keep trying to re-run this failed pod on the same node, indicating error states.
   2. a way to inform scheduler to re-schedule Pod somewhere else.
   3. combination: of two above:  Scheduler can re-check availability of alternatives to run this pod, and if there are other candidate nodes in the list, re-schedule pod on one of those nodes. If there are no other node alternatives, keep it assigned in crashloop on current node.

This seems to suggest that we should not treat Topology Affinity Errors in a special way: they are yet another instance of node-related pod admission failures, no extra care needed in the scheduler.
Is this a fair summarization?

The approach described above solves the fundamental issue some of us (myself included!) have with the current behaviour of k8s, on which a pod rejected by Topology Manager ends up in error/TopologyAffinityError. This is suboptimal and should be improved.

The drawback I can see in the approach you described above is is that the pod can spend quite some time waiting to be scheduled, because

  1. the scheduler, having no knowledge about HW specifics, picks nodes randomly. Here "randomly" means that the selected node is not more likely than others to admit the pod. The scheduler just doesn't know.
  2. the randomly picked node rejects pod admission because topology manager said so.
  3. the scheduler somehow learns about the pod needs to be re-scheduled
  4. the scheduler picks another "random" node, and the cycle begins anew

So the issue is there is no guarantee whatsoever regarding how long takes for a pod to actually get running, even if we know ahead of time the cluster can actually run the pod. The pod can just be subject to an unlucky streak of random picks.

Second problem of "aligning resources": this problem is complex, and can't be assumed just because there is more common pattern of building servers or what actually for particular workload will be optimal "alignment". For some workloads it will require combination of CPUs / Memory Nodes and particular PCI devices aligned. For some memory might be not critical, as workload is not memory intensive and only CPU vs. PCI device matters. For some it might require memory from more than one region, etc... Exposing all of those conditions on scheduler level is not that simple task, and generally not needed for many of k8s setups where nodes are not exposing real hardware topology. If we want to go to that path, there are some fundamental changes that need to be done in kubelet on how to properly expose and count resources (and collect information about availability of such). There are some blueprints of that kind in virtual kubelet project, it might require changes in CRI apis and on runtimes side...

Makes sense to me. I can see why this problem is complex and hard. So as general direction it seems better to keep the knowledge of the HW details/topology on the node, and not propagate these details in the cluster, right?

In other words, from resource assignment perspective, if we keep this check in the kubelet, we will never know if a node can actually run a given pod (/container) unless we try to schedule on that specific node and we let Topology Manager (or any other node component which knows about all these detials) do this check.

This is not necessarily a problem, but if we all as community we agree that this is the right direction, then this becomes a pretty strong constraint we all should be well aware of, so I'm taking the chance to write it down explicitely :)

So, I'd really suggest to focus on first fundamental issue mentioned above: "Pod is scheduled on the node, some (potentially) unrecoverable error for this particular during creation/starting occur. How we gracefully re-schedule it somewhere to another candidate node or properly return error to user if we can't run this pod anywhere else in the cluster".

This is an approach we talked about (internally) like a couple month ago. Besides the unpredictable scheduling delay I described above, nothing really wrong here from my perspective, and could be a nice first step.
But this approach requires (like you described above)

  1. a way to inform scheduler to re-schedule Pod somewhere else.
    and, to avoid looping on the same node,
  2. a way for the scheduler to remember somehow which nodes rejected the pod, in order to be able to try new nodes

Does this look right?
In this framework, my understanding is the basic concept behind this KEP was to move a step further, avoiding the scheduler retry loop and trying to pick the right node (= a node whose Topology manager is most likely to admit the pod) already on the first try.

@Huang-Wei
Copy link
Member

Huang-Wei commented Jun 19, 2020

In general, what we have discussed here falls into 2 directions:

  • Use scheduler to do the topology-related computation, although it's sort of a simplified version - this is what this KEP suggests.
  • As @kad mentioned, "schedule-propose-but-kubelet-reject" failure is a general problem, putting too much HW specific calculation in scheduler doesn't look quite good. Probably we should come up with more general mechanics to learn from this "schedule-propose-but-kubelet-reject" failure, and then hopefully the pod will end up landing on an admitted node.

A 3rd option comes into my mind which combines the above 2 ones. Here is a very rough idea: scheduler doesn't add in topology-related computation; instead, it suggests more than 1 node (if there are) for pod placement by modifying the node's .status.nominatedNodeName field (we may need a new field) to notify nominated nodes to take further checks to see they can admit this pod. If more than one node can accommodate this pod, they compete for running this pod by updating .spec.nodeName to itself. And only the first accommodation request will succeed, the other ones failed due to APIResourceConflict errors.

BTW: to avoid compute overhead on every potential node, the suggested number should be configured, such as defaulting to 5.

@AlexeyPerevalov
Copy link
Contributor Author

In general, what we have discussed here falls into 2 directions:

  • Use scheduler to do the topology-related computation, although it's sort of a simplified version - this is what this KEP suggests.
  • As @kad mentioned, "schedule-propose-but-kubelet-reject" failure is a general problem, putting too much HW specific calculation in scheduler doesn't look quite good. Probably we should come up with more general mechanics to learn from this "schedule-propose-but-kubelet-reject" failure, and then hopefully the pod will end up landing on an admitted node.

A 3rd option comes into my mind which combines the above 2 ones. Here is a very rough idea: scheduler doesn't add in topology-related computation; instead, it suggests more than 1 node (if there are) for pod placement by modifying the node's .status.nominatedNodeName field (we may need a new field) to notify nominated nodes to take further checks to see they can admit this pod. If more than one node can accommodate this pod, they compete for running this pod by updating .spec.nodeName to itself. And only the first accommodation request will succeed, the other ones failed due to APIResourceConflict errors.

I have a question regarding 3rd option:

  1. scheduler doesn't add in topology-related computation; instead, it suggests more than 1 node - these more than 1 node is nominated nodes, so the way we found it is based just on pkg/scheduler/framework/plugins/noderesources/fit.go
  2. notify nominated nodes - does it mean we call kubelet admit handler from kube-scheduler or it's yet another new (additional and optional) stage of scheduling process?

BTW: to avoid compute overhead on every potential node, the suggested number should be configured, such as defaulting to 5.

@ffromani
Copy link
Contributor

In general, what we have discussed here falls into 2 directions:
[...]
* As @kad mentioned, "schedule-propose-but-kubelet-reject" failure is a general problem, putting too much HW specific calculation in scheduler doesn't look quite good. Probably we should come up with more general mechanics to learn from this "schedule-propose-but-kubelet-reject" failure, and then hopefully the pod will end up landing on an admitted node.

There it was some initial talking about this approach. The initial idea was something along these lines. The biggest problem was the existing controllers, like in this scenario:

  1. a Pod gets scheduled, lands on a node, fails admission by Topology Manager, kubelet sets it as failed for TopologyAffinityError
  2. a controller sees the Pod failed, reschedules it, but pod still fails admission, so failed pods pile up rally fast
    I think @vpickard and @swatisehgal can add more details about this scenario
    So we likely need a way to handle this scenario, possibly without changing all the existing controllers.

A 3rd option comes into my mind which combines the above 2 ones. Here is a very rough idea: scheduler doesn't add in topology-related computation; instead, it suggests more than 1 node (if there are) for pod placement by modifying the node's .status.nominatedNodeName field (we may need a new field) to notify nominated nodes to take further checks to see they can admit this pod. If more than one node can accommodate this pod, they compete for running this pod by updating .spec.nodeName to itself. And only the first accommodation request will succeed, the other ones failed due to APIResourceConflict errors.

BTW: to avoid compute overhead on every potential node, the suggested number should be configured, such as defaulting to 5.

Looks neat! However, how should the competition be regulated? IOW, how this model interacts with configured replicas? Let's say I run a deployment which wants exactly one of my pod, I think it may very well happen in this scenario that, say, three nodes try to run my pod, only one wins (obviously) and keep the pod going, and the losers notice and silently kill their pods. Is my understanding right?

@Huang-Wei
Copy link
Member

Re @AlexeyPerevalov:

  1. scheduler doesn't add in topology-related computation; instead, it suggests more than 1 node - these more than 1 node is nominated nodes, so the way we found it is based just on pkg/scheduler/framework/plugins/noderesources/fit.go

Yes, as well as other existing Filtering constraints.

  1. notify nominated nodes - does it mean we call kubelet admit handler from kube-scheduler or it's yet another new (additional and optional) stage of scheduling process?

Nope, there is still no interaction between kube-scheduler and kuebelet. Here is how it works: now only one kubelet "claims" the pod (with .spec.nodName set by scheduler), so it's likely that admit handler may fail it. However, if multiple kubelets "pre-claim" the pod, and hopefully at least one can admit the pod, or several kubelets competes in "claim"ing the pod.

Re @fromanirh, I understand both option 2 & 3 are very rough ideas, but would be good to brainstorm here.

Looks neat! However, how should the competition be regulated? IOW, how this model interacts with configured replicas? Let's say I run a deployment which wants exactly one of my pod, I think it may very well happen in this scenario that, say, three nodes try to run my pod, only one wins (obviously) and keep the pod going, and the losers notice and silently kill their pods. Is my understanding right?

Not really. There is always one pod for each replica. It's just the logic gets changed from "one kubelet admit and run the pod" to "N kubelets admit and compete to run the pod". So there is no concept of "kill their pod" as there is only one pod. Only one kubelet winner will be able to change the pod's ".spec.nodeName" to its node name, that's it.

@ahg-g
Copy link
Member

ahg-g commented Jun 19, 2020

The problem with providing multiple node nominations is that it still doesn't guarantee that a node will admit the pod, moreover we get into a tuning issue where we need to select how many nodes to nominate, which will likely differ by cluster size, current cluster utilization etc.

@AlexeyPerevalov
Copy link
Contributor Author

  1. notify nominated nodes - does it mean we call kubelet admit handler from kube-scheduler or it's yet another new (additional and optional) stage of scheduling process?

Nope, there is still no interaction between kube-scheduler and kuebelet. Here is how it works: now only one kubelet "claims" the pod (with .spec.nodName set by scheduler), so it's likely that admit handler may fail it. However, if multiple kubelets "pre-claim" the pod, and hopefully at least one can admit the pod, or several kubelets competes in "claim"ing the pod.

It looks like new stage of scheduling. kube-scheduler says these nodes are nominated and nodes are going to run its admit handler, if it passed kubelet on the node change e.g. pod's status to ReadyToRunOnTheNode or AdmitHandlerPassed, kube-scheduler tracks it down and changes pod's spec.nodeName to appropriate node.

But such behavior is necessary only for nodes with enabled TopologyManager and only for such nodes scheduling will be little bit longer.

@AlexeyPerevalov
Copy link
Contributor Author

The problem with providing multiple node nominations is that it still doesn't guarantee that a node will admit the pod, moreover we get into a tuning issue where we need to select how many nodes to nominate, which will likely differ by cluster size, current cluster utilization etc.

I think maximum number of nominated nodes could be defined in configuration or yes evaluated dynamically.

@AlexeyPerevalov AlexeyPerevalov changed the title Deducted version of topology manager in kube-scheduler Simplified version of topology manager in kube-scheduler Jun 21, 2020
@AlexeyPerevalov AlexeyPerevalov marked this pull request as ready for review November 6, 2020 08:12
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 6, 2020
swatisehgal added a commit to swatisehgal/topology-aware-scheduler-plugin that referenced this pull request Nov 20, 2020
As part of the enablement of Topology-aware scheduling in kubernetes,
the scheduler plugin has been proposed as in-tree.

KEP: kubernetes/enhancements#1858
PR: kubernetes/kubernetes#90708

To enable faster development velocity, testing and community adoption
we are also packaging as an out of tree scheduler plugin. This out-of-tree
implementation is based on the above PR and KEP.
swatisehgal added a commit to swatisehgal/topology-aware-scheduler-plugin that referenced this pull request Nov 20, 2020
As part of the enablement of Topology-aware scheduling in kubernetes,
the scheduler plugin has been proposed as in-tree.

KEP: kubernetes/enhancements#1858
PR: kubernetes/kubernetes#90708

To enable faster development velocity, testing and community adoption
we are also packaging as an out of tree scheduler plugin. This out-of-tree
implementation is based on the above PR and KEP.
swatisehgal added a commit to swatisehgal/topology-aware-scheduler-plugin that referenced this pull request Nov 20, 2020
As part of the enablement of Topology-aware scheduling in kubernetes,
the scheduler plugin has been proposed as in-tree.

KEP: kubernetes/enhancements#1858
PR: kubernetes/kubernetes#90708

To enable faster development velocity, testing and community adoption
we are also packaging as an out of tree scheduler plugin. This out-of-tree
implementation is based on the above PR and KEP.
swatisehgal added a commit to swatisehgal/topology-aware-scheduler-plugin that referenced this pull request Nov 20, 2020
As part of the enablement of Topology-aware scheduling in kubernetes,
the scheduler plugin has been proposed as in-tree.

KEP: kubernetes/enhancements#1858
PR: kubernetes/kubernetes#90708

To enable faster development velocity, testing and community adoption
we are also packaging as an out of tree scheduler plugin. This out-of-tree
implementation is based on the above PR and KEP.
@herb-duan
Copy link

m

@AlexeyPerevalov
Copy link
Contributor Author

We decided to focus on out-of-tree plugin kubernetes-sigs/scheduler-plugins#119. This KEP, will be postponed until noderesourcetopology-api not in the staging or built-in.
/hold
or maybe close (and reopen, when use case of the private clouds on bare-metal on-premise will be more widely spread)

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 25, 2021
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 25, 2021
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels May 25, 2021
@AlexeyPerevalov
Copy link
Contributor Author

/close
in favor to kubernetes-sigs/scheduler-plugins#119

@k8s-ci-robot
Copy link
Contributor

@AlexeyPerevalov: Closed this PR.

In response to this:

/close
in favor to kubernetes-sigs/scheduler-plugins#119

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

swatisehgal added a commit to swatisehgal/enhancements that referenced this pull request Jun 10, 2021
- This KEP consolidates the following two KEPs into one
  - kubernetes#1858
  - kubernetes#1870

- Also the KEP talks about introducing NodeResourceTopology
   as a native Kubernetes resource.

Signed-off-by: Swati Sehgal <swsehgal@redhat.com>
swatisehgal added a commit to swatisehgal/enhancements that referenced this pull request Jun 10, 2021
- This KEP consolidates the following two KEPs into one
  - kubernetes#1858
  - kubernetes#1870

- Also the KEP talks about introducing NodeResourceTopology
   as a native Kubernetes resource.

Co-authored-by: Alexey Perevalov <alexey.perevalov@huawei.com>
Signed-off-by: Swati Sehgal <swsehgal@redhat.com>
swatisehgal added a commit to swatisehgal/enhancements that referenced this pull request Jun 10, 2021
- This KEP consolidates the following two KEPs into one
  - kubernetes#1858
  - kubernetes#1870

- Also the KEP talks about introducing NodeResourceTopology
   as a native Kubernetes resource.

Co-authored-by: Alexey Perevalov <alexey.perevalov@huawei.com>
Signed-off-by: Swati Sehgal <swsehgal@redhat.com>
swatisehgal added a commit to swatisehgal/enhancements that referenced this pull request Jun 10, 2021
- This KEP consolidates the following two KEPs into one
  - kubernetes#1858
  - kubernetes#1870

- Also the KEP talks about introducing NodeResourceTopology
   as a native Kubernetes resource.

Co-authored-by: Alexey Perevalov <alexey.perevalov@huawei.com>
Signed-off-by: Swati Sehgal <swsehgal@redhat.com>
swatisehgal added a commit to swatisehgal/enhancements that referenced this pull request Jun 16, 2021
- This KEP consolidates the following two KEPs into one
  - kubernetes#1858
  - kubernetes#1870

- Also the KEP talks about introducing NodeResourceTopology
   as a native Kubernetes resource.

Co-authored-by: Alexey Perevalov <alexey.perevalov@huawei.com>
Signed-off-by: Swati Sehgal <swsehgal@redhat.com>
swatisehgal added a commit to swatisehgal/enhancements that referenced this pull request Jun 16, 2021
- This KEP consolidates the following two KEPs into one
  - kubernetes#1858
  - kubernetes#1870

- Also the KEP talks about introducing NodeResourceTopology
   as a native Kubernetes resource.

Co-authored-by: Alexey Perevalov <alexey.perevalov@huawei.com>
Signed-off-by: Swati Sehgal <swsehgal@redhat.com>
swatisehgal added a commit to swatisehgal/enhancements that referenced this pull request Jun 16, 2021
- This KEP consolidates the following two KEPs into one
  - kubernetes#1858
  - kubernetes#1870

- Also the KEP talks about introducing NodeResourceTopology
   as a native Kubernetes resource.

Co-authored-by: Alexey Perevalov <alexey.perevalov@huawei.com>
Signed-off-by: Swati Sehgal <swsehgal@redhat.com>
swatisehgal added a commit to swatisehgal/enhancements that referenced this pull request Jun 16, 2021
- This KEP consolidates the following two KEPs into one
  - kubernetes#1858
  - kubernetes#1870

- Also the KEP talks about introducing NodeResourceTopology
   as a native Kubernetes resource.

Co-authored-by: Alexey Perevalov <alexey.perevalov@huawei.com>
Signed-off-by: Swati Sehgal <swsehgal@redhat.com>
swatisehgal added a commit to swatisehgal/enhancements that referenced this pull request Jul 2, 2021
- This KEP consolidates the following two KEPs into one
  - kubernetes#1858
  - kubernetes#1870

- Also the KEP talks about introducing NodeResourceTopology
   as a native Kubernetes resource.

Co-authored-by: Alexey Perevalov <alexey.perevalov@huawei.com>
Signed-off-by: Swati Sehgal <swsehgal@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. sig/architecture Categorizes an issue or PR as relevant to SIG Architecture. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

scheduler being topology-unaware can cause runaway pod creation