Topology awareness in Kube-scheduler #2787

swatisehgal · 2021-06-17T16:41:31Z

This KEP consolidates the following two KEPs into one:
- Simplified version of topology manager in kube-scheduler #1858
- Topology aware resource provisioning daemon #1870
Also the KEP talks about introducing NodeResourceTopology
as a native Kubernetes resource.

swatisehgal · 2021-06-17T16:43:37Z

alculquicondor · 2021-06-18T15:33:19Z

keps/sig-scheduling/2044-topology-awareness-in-kube-scheduler/README.md

+
+Kube-scheduler plugin will be moved from kuberntes-sigs/scheduler-plugin (or out-of-tree)
+plugin into the main tree as a built-in plugin. This plugin implements a simplified version of Topology Manager and hence is different from original topology manager algorithm. Plugin would
+be disabled by default and when enabled would check for the ability to run pod only in case of single-numa-node policy on the node, since it is the most strict policy, it implies that the launch on the node with other existing policies will be successful if the condition for single-numa-node policy passed for the worker node.


what if the node is not under this policy? It sound like it could lead to underutilization.

It is a sys-admin decision to configure kubelet with Topology Manager policy. Not all workloads require strict NUMA alignment and therefore Topology Manager policy could be configured to be none or restricted. More information on this is here. The default policy is none where Topology Manager does not perform any topology alignment. But for a workload that requires NUMA alignment a node where single-numa-node policy is not configured would lead to underperformance of the workload.

alculquicondor · 2021-06-18T15:34:48Z

keps/sig-scheduling/2044-topology-awareness-in-kube-scheduler/README.md

+
+- Make scheduling process more precise when we have NUMA topology on the
+worker node.
+- Enhance the node object to capture topology information which can be referred to 


It looks like you are adding a new resource, not changing the existing Node object. Not saying that you should. Just update this point to reflect the actual proposal.

I would also argue that this is not a goal on of itself, this is a tool to achieve the goal, which is numa-aware scheduling.

To highlight that we are adding a new resource, I would explicitly add a paragraph at the end of the summary section to describe the changes/artifacts:

A new scheduler plugin that makes topology-aware placement decisions

A new resource object, NodeResourceTopology to communicate NUMA status between kubelet and kube-scheduler

kubelet changes to populate NodeResourceTopology

alculquicondor · 2021-06-18T15:37:22Z

keps/sig-scheduling/2044-topology-awareness-in-kube-scheduler/README.md

+    Capacity    string `json:"capacity"`
+}
+
+type ZoneList []Zone


nit: I don't see the need for these aliases, just use the slices directly.

Sure, will remove them

alculquicondor · 2021-06-18T15:39:30Z

keps/sig-scheduling/2044-topology-awareness-in-kube-scheduler/README.md

+	metav1.ObjectMeta `json:"metadata,omitempty"`
+
+	TopologyPolicy []string `json:"topologyPolicies"`
+	Zones          ZoneMap  `json:"zones"`


ZoneMap is not defined

That was a mistake, will fix

alculquicondor · 2021-06-18T15:41:11Z

keps/sig-scheduling/2044-topology-awareness-in-kube-scheduler/README.md

+    Name       string           `json:"name"`
+    Type       string           `json:"type"`
+    Parent     string           `json:"parent,omitempty"`
+    Costs      CostList         `json:"costs,omitempty"`


Is this what is currently used?

Yes, we do. You can find the current API we use here: https://github.com/k8stopologyawareschedwg/noderesourcetopology-api/blob/master/pkg/apis/topology/v1alpha1/types.go

alculquicondor · 2021-06-18T15:55:11Z

keps/sig-scheduling/2044-topology-awareness-in-kube-scheduler/README.md

+
+`serviceAccountName: noderesourcetopology-account` would have to be added to the manifest file of the scheduler deployment file.
+
+# Use cases


Please follow the template: this should be within the Proposal section.

alculquicondor · 2021-06-18T15:57:13Z

keps/sig-scheduling/2044-topology-awareness-in-kube-scheduler/README.md

+
+Once the information is captured in the NodeResourceTopology API, the scheduler can refer to
+it like it refers to Node Capacity and Allocatable while making a Topology-aware Scheduling decision.
+


Add

# Design details

alculquicondor · 2021-06-18T15:57:55Z

keps/sig-scheduling/2044-topology-awareness-in-kube-scheduler/README.md

+
+Numbers of kubernetes worker nodes on bare metal with NUMA topology. TopologyManager feature gate enabled on the nodes. In this configuration, the operator does not want that in the case of an unsatisfactory host topology, it should be re-scheduled for launch, but wants the scheduling to be successful the first time.
+
+# Known limitations


Also part of the Proposal, as either Notes/Constraints/Caveats or Risks and Mitigations

alculquicondor · 2021-06-18T16:01:01Z

keps/sig-scheduling/2044-topology-awareness-in-kube-scheduler/README.md

+
+# Known limitations
+
+Kube-scheduler makes an assumption about current resource usage on the worker node, since kube-scheduler knows which pod assigned to node. This assumption makes right after kube-scheduler choose a node. But in case of scheduling with NUMA topology only TopologyManager on the worker node knows exact NUMA node used by pod, this information about NUMA node delivers to kube-scheduler with latency. In this case kube-scheduler will not know actual NUMA topology until topology exporter will send it back. It could be mitigated if kube-scheduler in proposed plugin will add a hint on which NUMA id pod could be assigned, further Topology Manager on the worker node may take it into account.


This paragraph needs some work.

Maybe you can put it as a numbered list to describe the sequence of events.

The main point is that we could have a similar situation of pods being scheduled to nodes where they don't fit, due to latency of the topology manager (part of the kubelet) reporting.

Is the mitigation something that you plan to do in alpha or beta?

Yeah, this does need to be reworded. Will do that.

We can certainly work on enabling some sort of communication between the Topology aware scheduler and the Kubelet to take the Numa node identified by the scheduler into consideration while aligning resources in alpha or beta.

alculquicondor · 2021-06-18T16:01:41Z

keps/sig-scheduling/2044-topology-awareness-in-kube-scheduler/README.md

+* Alpha (v1.23)
+
+Following changes are required:
+- [ ] Introducing a Topolgy information as part of Node API


Also:
population by kubelet

ahg-g

Please use the canonical template: https://github.com/kubernetes/enhancements/tree/master/keps/NNNN-kep-template

ahg-g · 2021-06-18T16:20:56Z

keps/sig-scheduling/2044-topology-awareness-in-kube-scheduler/README.md

+In order to address this issue, scheduler needs to choose a node considering resource availability
+along with underlying resource topology and Topology Manager policy on the worker node.
+
+This document describes behaviour of the Kubernetes Scheduler which takes worker node topology into account.


Suggested change

This document describes behaviour of the Kubernetes Scheduler which takes worker node topology into account.

This enhancement proposes changes to make kube-scheduler aware of node NUMA topology when making scheduling decisions.

ahg-g · 2021-06-18T16:23:50Z

keps/sig-scheduling/2044-topology-awareness-in-kube-scheduler/README.md

+
+- Make scheduling process more precise when we have NUMA topology on the
+worker node.
+- Enhance the node object to capture topology information which can be referred to 


I would also argue that this is not a goal on of itself, this is a tool to achieve the goal, which is numa-aware scheduling.

ahg-g · 2021-06-18T16:28:21Z

keps/sig-scheduling/2044-topology-awareness-in-kube-scheduler/README.md

+## Non-Goals
+
+- Change the PodSpec to allow requesting a specific node topology manager policy
+- This Proposal requires exposing NUMA topology information. This KEP doesn't


isn't that information already known by topology manager, which is part of kubelet?

ahg-g · 2021-06-18T16:29:14Z

keps/sig-scheduling/2044-topology-awareness-in-kube-scheduler/README.md

+
+- Make scheduling process more precise when we have NUMA topology on the
+worker node.
+- Enhance the node object to capture topology information which can be referred to 


To highlight that we are adding a new resource, I would explicitly add a paragraph at the end of the summary section to describe the changes/artifacts:

A new scheduler plugin that makes topology-aware placement decisions

A new resource object, NodeResourceTopology to communicate NUMA status between kubelet and kube-scheduler

kubelet changes to populate NodeResourceTopology

ahg-g · 2021-06-18T16:45:53Z

keps/sig-scheduling/2044-topology-awareness-in-kube-scheduler/README.md

+plugin into the main tree as a built-in plugin. This plugin implements a simplified version of Topology Manager and hence is different from original topology manager algorithm. Plugin would
+be disabled by default and when enabled would check for the ability to run pod only in case of single-numa-node policy on the node, since it is the most strict policy, it implies that the launch on the node with other existing policies will be successful if the condition for single-numa-node policy passed for the worker node.
+
+To work, this plugin requires topology information of the available resource on the worker nodes.


Suggested change

To work, this plugin requires topology information of the available resource on the worker nodes.

To work, this plugin requires topology information of the available resources for each NUMA cell on worker nodes.

ahg-g · 2021-06-18T17:14:32Z

keps/sig-scheduling/2044-topology-awareness-in-kube-scheduler/README.md

+The current policies of TopologyManager can't coexist together at the same time, but in future such kind of policies could appear.
+For example we can have policy for HyperThreading and it can live with NUMA policies.
+
+To use these policy names both in kube-scheduler and in kubelet, string constants of these labels should be moved from pkg/kubelet/cm/topologymanager/ and pkg/kubelet/apis/config/types.go to pkg/apis/core/types.go a one single place.


this is not necessary to discuss in the KEP, lets focus on the enhancement itself, its semantics and API.

Okay, will remove this.

ahg-g · 2021-06-18T17:34:29Z

keps/sig-scheduling/2044-topology-awareness-in-kube-scheduler/README.md

+
+## Plugin implementation details
+
+### Description of the Algorithm


this section should discuss semantics of the feature, the dependencies, the flow of information and any potential race conditions.

A description of how things will work and status updates starting from node/scheduler startup, scheduling a pod, scheduling another pod before the first one getting picked up by kubelet, deleing a pod etc. What changes does all of the cause to the state in kubelet, api-server and scheduler if any.

ahg-g · 2021-06-18T17:34:58Z

keps/sig-scheduling/2044-topology-awareness-in-kube-scheduler/README.md

+
+The algorithm which implements single-numa-node policy is following:
+
+```go


yes, please avoid code, see also comment above.

ahg-g · 2021-06-18T17:41:33Z

keps/sig-scheduling/2044-topology-awareness-in-kube-scheduler/README.md

+
+
+
+# Alternative Solution


in this section, I would describe the alternative proposal, pros/cons compared to primary one and why the primary one is better. Also, this is not proposing an end-to-end alternative, only the kubelet part, if so please make that clear as well.

The proto details mentioned in the following section are not really helpful in evaluating that.

can you discuss the following potential end-to-end alternative: 1:1 worker pod to node assignment, so apart from kubelet and daemonsets, the pod will take the whole node and the application is responsible for forking processes and assign them to NUMA cells. How feasible is this, do we have frameworks/libraries that allows applications to do that (e.g., do MPI libraries enable this)?

in this section, I would describe the alternative proposal, pros/cons compared to primary one and why the primary one is better. Also, this is not proposing an end-to-end alternative, only the kubelet part, if so please make that clear as well.

I will fix this and make this more coherent.

can you discuss the following potential end-to-end alternative: 1:1 worker pod to node assignment, so apart from kubelet and daemonsets, the pod will take the whole node and the application is responsible for forking processes and assign them to NUMA cells. How feasible is this, do we have frameworks/libraries that allows applications to do that (e.g., do MPI libraries enable this)?

This is a very interesting point! Topology Manager can deal with the alignment of resources at container and pod scope. So if we have a pod with multiple containers, a process running inside the containers can be assigned to a NUMA cell but I don't think we would be able to more granular than that (assign processes inside container to separate NUMA nodes) without making major changes to Kubelet.

ahg-g · 2021-06-18T17:44:03Z

keps/sig-scheduling/2044-topology-awareness-in-kube-scheduler/README.md

+
+The daemon which runs outside of the kubelet will collect all necessary information on running pods, based on allocatable resources of the node and consumed resources by pods it will provide available resources in CRD, where one CRD instance represents one worker node. The name of the CRD instance is the name of the worker node.
+
+## CRD API


Is it still a CRD even if we plan to host it in core?

This was part of the alternative solution section

k8s-ci-robot · 2021-07-02T17:14:11Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: swatisehgal
To complete the pull request process, please assign huang-wei, wojtek-t after the PR has been reviewed.
You can assign the PR to them by writing /assign @huang-wei @wojtek-t in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

keps/prod-readiness/OWNERS
keps/sig-scheduling/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

- This KEP consolidates the following two KEPs into one - kubernetes#1858 - kubernetes#1870 - Also the KEP talks about introducing NodeResourceTopology as a native Kubernetes resource. Co-authored-by: Alexey Perevalov <alexey.perevalov@huawei.com> Signed-off-by: Swati Sehgal <swsehgal@redhat.com>

rphillips · 2021-07-08T16:10:17Z

keps/sig-scheduling/2044-topology-awareness-in-kube-scheduler/kep.yaml

+  - "@huang-wei"
+  - "@derekwaynecarr"
+  - "@mrunalp"
+  - "@rphilips"


Thanks, will fix now.

Signed-off-by: Swati Sehgal <swsehgal@redhat.com>

rphillips · 2021-07-08T16:18:43Z

keps/sig-scheduling/2044-topology-awareness-in-kube-scheduler/README.md

+		1. If requested resource cannot be found on a node, it is unset as available NUMA cell
+		1. If an unknown resource has 0 quantity, the NUMA cell should be left set.
+	* The following checks are performed:
+		1. Add NUMA cell to the resourceBitmask if resource is cpu and it's not guaranteed QoS, since cpu will flow


Do constants within the API for 'cpu' and 'memory' make sense?

... perhaps others?

"cpu", "memory" and other resources are not stored as constants but rather stored in the ResourceInfo data structure below where the name corresponds the resource name stored here as value which is obtained from the v1.ResourceCPU or v1.ResourceMemory and in the scheduler plugin a resource is compared with values v1.ResourceCPU or v1.ResourceMemory to determine if the resource is a CPU or memory.

type ResourceInfo struct { Name string `json:"name"` Allocatable string `json:"allocatable"` Capacity string `json:"capacity"` }

swatisehgal · 2021-07-08T16:35:39Z

keps/sig-scheduling/2044-topology-awareness-in-kube-scheduler/README.md

+    Name  string `json:"name"`
+    Value string `json:"value"`
+}
+// Kubelet writes to NodeResourceTopology


@rphillips Here I have provided real world example of how these fields would look like once populated. Thought it was best to do that after all the data structures have been defined. Would you prefer for it to be moved before the data structures are defined or maybe in a different section?

alculquicondor · 2021-08-09T15:55:07Z

keps/sig-scheduling/2044-topology-awareness-in-kube-scheduler/kep.yaml

+  - "@ehashman"
+  - "@klueska"
+approvers:
+  - "@sig-scheduling-leads"


needs approver from sig-node as well.

alculquicondor · 2021-08-09T15:56:04Z

keps/sig-scheduling/2044-topology-awareness-in-kube-scheduler/kep.yaml

+#prr-approvers:
+
+see-also:
+  - "https://github.com/kubernetes/enhancements/pull/1870"


These should be links to related merged KEPs. Example: /keps/sig-node/693-topology-manager

alculquicondor · 2021-08-09T16:05:33Z

keps/sig-scheduling/2044-topology-awareness-in-kube-scheduler/README.md

+
+### Risks and Mitigations
+
+Topology Manager on the worker node knows exact resources and their NUMA node allocated to pods but the and node resource


it looks like there is some typo here.

alculquicondor · 2021-08-09T16:07:06Z

keps/sig-scheduling/2044-topology-awareness-in-kube-scheduler/README.md

+actual NUMA topology until the information of the available resources at a NUMA node level is evaluated in the kubelet which
+could still lead to scheduling of pods to nodes where they won't be admitted by Topology Manager.
+
+This can be mitigated if kube-scheduler provides a hint of which NUMA ID a pod should be assigned and Topology Manager on the


Elaborate. How is the NUMA ID used in follow-up scheduling cycles?

alculquicondor · 2021-08-09T16:12:24Z

keps/sig-scheduling/2044-topology-awareness-in-kube-scheduler/README.md

+    Costs      CostList         `json:"costs,omitempty"`
+    Attributes AttributeList    `json:"attributes,omitempty"`
+    Resources  ResourceInfoList `json:"resources,omitempty"`
+}


better also provide the comments

alculquicondor · 2021-08-09T16:18:11Z

keps/sig-scheduling/2044-topology-awareness-in-kube-scheduler/README.md

+
+1. At the filter extension point of the plugin, the QoS class of the pod is determined, in case it is a best effort pod or the 
+   Topology Manager Policy configured on the node is not single-numa-node, the node is not considered for scheduling
+1. The Topology Manager Scope is determined.


Can you elaborate what this is?

alculquicondor · 2021-08-09T16:21:20Z

keps/sig-scheduling/2044-topology-awareness-in-kube-scheduler/README.md

+
+- [ ] Implementation of Score extension point
+- [ ] Add node E2E tests.
+- [ ] Provide beta-level documentation.


Documentation should be part of alpha too

alculquicondor · 2021-08-09T16:24:44Z

keps/sig-scheduling/2044-topology-awareness-in-kube-scheduler/README.md

+
+## Design Details
+
+- add a new flag in Kubelet called `ExposeNodeResourceTopology` in the kubelet config or command line argument called `expose-noderesourcetopology` which allows 


Take a decision here: Are you proposing a permanent command line flag or not? Note that this is different from the feature gate, which should disappear in a a few releases

alculquicondor · 2021-08-09T16:26:28Z

keps/sig-scheduling/2044-topology-awareness-in-kube-scheduler/README.md

+  CRI or CNI may require updating that component before the kubelet.
+-->
+
+Feature flag will apply to kubelet only, so version skew strategy is N/A.


what happens in the kubelet when you disable the flag? Does the NodeResourceTopology object get deleted?
If I downgrade a kubelet, note that the object might stay there.

alculquicondor · 2021-08-09T16:27:29Z

keps/sig-scheduling/2044-topology-awareness-in-kube-scheduler/README.md

+- [X] Enable Scheduler scheduler plugin `NodeResourceTopologyMatch` in the KubeScheduler config
+  - Describe the mechanism:
+  - Will enabling / disabling the feature require downtime of the control
+    plane?  Yes, Feature gate must be set on kubelet start. To disable, kubelet must be


kube-scheduler

denkensk · 2021-08-10T12:15:11Z

keps/sig-scheduling/2044-topology-awareness-in-kube-scheduler/README.md

+// Kubelet writes to NodeResourceTopology
+// and scheduler plugin reads from it 
+// Real world example of how these fields are populated is as follows:
+// Cells:


Cells is map. Can you add the key of Cells map in this example?

denkensk · 2021-08-10T12:24:45Z

keps/sig-scheduling/2044-topology-awareness-in-kube-scheduler/README.md

+	metav1.TypeMeta   `json:",inline"`
+	metav1.ObjectMeta `json:"metadata,omitempty"`
+
+	TopologyPolicies []string 		   `json:"topologyPolicies"`


Can you list the policies? And explain them?
like：SingleNUMANodeContainerLevel、SingleNUMANodePodLevel ？

denkensk · 2021-08-10T12:28:12Z

keps/sig-scheduling/2044-topology-awareness-in-kube-scheduler/README.md

+    Costs      CostList         `json:"costs,omitempty"`
+    Attributes AttributeList    `json:"attributes,omitempty"`
+    Resources  ResourceInfoList `json:"resources,omitempty"`
+}


+1 for comments
Just by reading the examples, I still don't understand what is costs.

denkensk · 2021-08-10T12:44:43Z

keps/sig-scheduling/2044-topology-awareness-in-kube-scheduler/README.md

+		1. Add NUMA cell to the resourceBitmask if resource is memory and it's not guaranteed QoS, since memory will flow
+		1. Add NUMA cell to the resourceBitmask if zero quantity for non existing resource was requested
+		1. otherwise check amount of resources
+	* Once the resourceBitMark is determined it is ANDed with the cummulative bitmask


cumulative ？

denkensk · 2021-08-10T12:55:16Z

keps/sig-scheduling/2044-topology-awareness-in-kube-scheduler/README.md

+		1. otherwise check amount of resources
+	* Once the resourceBitMark is determined it is ANDed with the cummulative bitmask
+4. If resources cannot be aligned from the same NUMA cell for a container, alignment cannot be achieved for the entire pod and the resource cannot be 
+   aligned in case of the pod under consideration. Such a pod is returned with a Status Unschedulable 


If we return Unschedulable, it will enter the postfilter phase. But we don't know which cell the pod is assigned to. So the DeletePod in PostFilter will not change the NodeResourceTopology. So it will always return false in PodPassesFiltersOnNode. Then this pod can never choose one or more pods to seize preempt.

k8s-triage-robot · 2021-11-08T13:01:38Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2021-12-08T13:48:54Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

ahg-g · 2021-12-09T20:04:32Z

/remove-lifecycle rotten

k8s-triage-robot · 2022-03-09T20:19:30Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

kerthcet · 2022-03-10T02:31:01Z

/remove-lifecycle stale

hormes · 2022-05-26T13:48:49Z

Is this topic is still in progress? We hope to help the community continue to promote this work.

denkensk · 2022-05-26T14:35:17Z

Is this topic is still in progress? We hope to help the community continue to promote this work.

The first version of topology scheduler has been implemented in https://github.com/kubernetes-sigs/scheduler-plugins/tree/master/pkg/noderesourcetopology And It is also constantly being optimized recently.

k8s-triage-robot · 2022-08-24T14:48:28Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2022-09-23T14:57:37Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot · 2022-10-23T15:38:31Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Reopen this PR with /reopen
Mark this PR as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

k8s-ci-robot · 2022-10-23T15:38:36Z

@k8s-triage-robot: Closed this PR.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Reopen this PR with /reopen

Mark this PR as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory labels Jun 17, 2021

k8s-ci-robot requested review from ahg-g and kikisdeliveryservice June 17, 2021 16:41

k8s-ci-robot added sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jun 17, 2021

swatisehgal changed the title ~~[WIP] Consolidate the two Topology aware Scheduling KEPs into one.~~ [WIP] Topology aware Scheduling in Kubernetes Jun 17, 2021

swatisehgal changed the title ~~[WIP] Topology aware Scheduling in Kubernetes~~ [WIP] Topology awareness in Kube-scheduler Jun 17, 2021

k8s-ci-robot requested a review from alculquicondor June 17, 2021 16:43

alculquicondor reviewed Jun 18, 2021

View reviewed changes

ahg-g reviewed Jun 18, 2021

View reviewed changes

swatisehgal force-pushed the topology-aware-scheduling-kep branch from d78bc60 to ca07ed5 Compare July 2, 2021 17:13

k8s-ci-robot removed the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jul 2, 2021

k8s-ci-robot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Jul 2, 2021

swatisehgal force-pushed the topology-aware-scheduling-kep branch 4 times, most recently from 8bcbf30 to abdc1bf Compare July 2, 2021 17:31

swatisehgal changed the title ~~[WIP] Topology awareness in Kube-scheduler~~ Topology awareness in Kube-scheduler Jul 8, 2021

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jul 8, 2021

swatisehgal force-pushed the topology-aware-scheduling-kep branch from abdc1bf to 25b4cef Compare July 8, 2021 13:36

rphillips reviewed Jul 8, 2021

View reviewed changes

Updates made based on first round of review

586c982

Signed-off-by: Swati Sehgal <swsehgal@redhat.com>

rphillips reviewed Jul 8, 2021

View reviewed changes

swatisehgal force-pushed the topology-aware-scheduling-kep branch from 25b4cef to 586c982 Compare July 8, 2021 16:18

swatisehgal commented Jul 8, 2021

View reviewed changes

alculquicondor reviewed Aug 9, 2021

View reviewed changes

denkensk reviewed Aug 10, 2021

View reviewed changes

alculquicondor mentioned this pull request Sep 23, 2021

Move AssumePod Op to ReservePlugins as defaultReserve kubernetes/kubernetes#93500

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 8, 2021

ahg-g mentioned this pull request Dec 3, 2021

Form a Batch Working Group kubernetes/community#6263

Closed

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Dec 8, 2021

k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Dec 9, 2021

swatisehgal mentioned this pull request Jan 12, 2022

empty noderesourcetopology-api repository in kubernetes-sigs kubernetes/community#6308

Open

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 9, 2022

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 10, 2022

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 24, 2022

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Sep 23, 2022

k8s-ci-robot closed this Oct 23, 2022

swatisehgal mentioned this pull request Feb 3, 2023

Remove NodeResourceTopology API from staging kubernetes/kubernetes#96275

Merged


		`serviceAccountName: noderesourcetopology-account` would have to be added to the manifest file of the scheduler deployment file.

		# Use cases


		Once the information is captured in the NodeResourceTopology API, the scheduler can refer to
		it like it refers to Node Capacity and Allocatable while making a Topology-aware Scheduling decision.


		Numbers of kubernetes worker nodes on bare metal with NUMA topology. TopologyManager feature gate enabled on the nodes. In this configuration, the operator does not want that in the case of an unsatisfactory host topology, it should be re-scheduled for launch, but wants the scheduling to be successful the first time.

		# Known limitations


		# Known limitations

		Kube-scheduler makes an assumption about current resource usage on the worker node, since kube-scheduler knows which pod assigned to node. This assumption makes right after kube-scheduler choose a node. But in case of scheduling with NUMA topology only TopologyManager on the worker node knows exact NUMA node used by pod, this information about NUMA node delivers to kube-scheduler with latency. In this case kube-scheduler will not know actual NUMA topology until topology exporter will send it back. It could be mitigated if kube-scheduler in proposed plugin will add a hint on which NUMA id pod could be assigned, further Topology Manager on the worker node may take it into account.

	This document describes behaviour of the Kubernetes Scheduler which takes worker node topology into account.
	This enhancement proposes changes to make kube-scheduler aware of node NUMA topology when making scheduling decisions.

	To work, this plugin requires topology information of the available resource on the worker nodes.
	To work, this plugin requires topology information of the available resources for each NUMA cell on worker nodes.


		## Plugin implementation details

		### Description of the Algorithm


		The algorithm which implements single-numa-node policy is following:

		```go


		The daemon which runs outside of the kubelet will collect all necessary information on running pods, based on allocatable resources of the node and consumed resources by pods it will provide available resources in CRD, where one CRD instance represents one worker node. The name of the CRD instance is the name of the worker node.

		## CRD API


		### Risks and Mitigations

		Topology Manager on the worker node knows exact resources and their NUMA node allocated to pods but the and node resource


		## Design Details

		- add a new flag in Kubelet called `ExposeNodeResourceTopology` in the kubelet config or command line argument called `expose-noderesourcetopology` which allows

Topology awareness in Kube-scheduler #2787

Topology awareness in Kube-scheduler #2787

Conversation

swatisehgal commented Jun 17, 2021 • edited Loading

swatisehgal commented Jun 17, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ahg-g left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

swatisehgal Jun 25, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

k8s-ci-robot commented Jul 2, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

k8s-triage-robot commented Nov 8, 2021

k8s-triage-robot commented Dec 8, 2021

ahg-g commented Dec 9, 2021

k8s-triage-robot commented Mar 9, 2022

kerthcet commented Mar 10, 2022

hormes commented May 26, 2022

denkensk commented May 26, 2022 • edited Loading

k8s-triage-robot commented Aug 24, 2022

k8s-triage-robot commented Sep 23, 2022

k8s-triage-robot commented Oct 23, 2022

k8s-ci-robot commented Oct 23, 2022

swatisehgal commented Jun 17, 2021 •

edited

Loading

swatisehgal Jun 25, 2021 •

edited

Loading

denkensk commented May 26, 2022 •

edited

Loading