Add Unschedulable PodCondition for pods in pending #535

Jeffwan · 2019-01-04T02:14:48Z

What this PR does / why we need it:
Please check #526 for details. Pod scheduled by kube-batch doesn't have rich pod conditions which makes other components like cluster autoscaler hard to interact with.

This PR will add PodConditions for pending pods in order for other kubernetes components to be aware.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #526

status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: 2019-01-04T01:49:02Z
    message: 0/2 nodes are available, 2 insufficient cpu.
    reason: Unschedulable
    status: "False"
    type: PodScheduled
  phase: Pending
  qosClass: Burstable

Special notes for your reviewer:
I tried following two alternatives, but both of them fails to create new nodes. At the time, CA detect pending pods but ignore scale up action after simulation. This is not accurate because it only consider resource putting pod one by one but not all the pending pods together. (Since CA project is designed for default scheduler, I am not sure if they will accept a change here)

Update status of specific pod failed in assignment here. https://github.com/kubernetes-sigs/kube-batch/blob/2bd479115971cf0a68f7a1384fc1875fc9f60073/pkg/scheduler/actions/allocate/allocate.go#L162-L164
Update entire pods status (same as this PR). https://github.com/kubernetes-sigs/kube-batch/blob/2bd479115971cf0a68f7a1384fc1875fc9f60073/pkg/scheduler/plugins/gang/gang.go#L158-L161

Logs from CA

I0104 01:49:44.045065       1 utils.go:142] Pod qj-1-nd485 marked as unschedulable can be scheduled on ip-192-168-46-32.us-west-2.compute.internal. Ignoring in scale up.
I0104 01:49:44.045162       1 utils.go:128] Pod qj-1-8djdt marked as unschedulable can be scheduled (based on simulation run for other pod owned by the same controller). Ignoring in scale up.
I0104 01:49:44.045245       1 utils.go:128] Pod qj-1-rdvph marked as unschedulable can be scheduled (based on simulation run for other pod owned by the same controller). Ignoring in scale up.
I0104 01:49:44.045327       1 utils.go:128] Pod qj-1-6px6z marked as unschedulable can be scheduled (based on simulation run for other pod owned by the same controller). Ignoring in scale up.

Logs from kube-batch (2 pods allocated and 2 pending)

I0103 17:50:16.825058   29996 allocate.go:42] Enter Allocate ...
I0103 17:50:16.825068   29996 allocate.go:57] Added Job <default/qj-1> into Queue <default>
I0103 17:50:16.825077   29996 allocate.go:61] Try to allocate resource to 1 Queues

I0103 17:50:16.825085   29996 allocate.go:78] Try to allocate resource to Jobs in Queue <default>
I0103 17:50:16.825101   29996 allocate.go:102] Try to allocate resource to 4 tasks of Job <default/qj-1>
I0103 17:50:16.825111   29996 allocate.go:109] There are <2> nodes for Job <default/qj-1>
I0103 17:50:16.825119   29996 allocate.go:120] Considering Task <default/qj-1-nd485> on node <ip-192-168-46-32.us-west-2.compute.internal>: <cpu 2000.00, memory 0.00, GPU 0.00> vs. <cpu 3690.00, memory 16568913920.00, GPU 0.00>
I0103 17:50:16.825159   29996 allocate.go:132] Binding Task <default/qj-1-nd485> to node <ip-192-168-46-32.us-west-2.compute.internal>
I0103 17:50:16.825174   29996 session.go:170] After allocated Task <default/qj-1-nd485> to Node <ip-192-168-46-32.us-west-2.compute.internal>: idle <cpu 1690.00, memory 16568913920.00, GPU 0.00>, used <cpu 2310.00, memory 146800640.00, GPU 0.00>, releasing <cpu 0.00, memory 0.00, GPU 0.00>

I0103 17:50:16.825196   29996 allocate.go:78] Try to allocate resource to Jobs in Queue <default>
I0103 17:50:16.825205   29996 allocate.go:102] Try to allocate resource to 3 tasks of Job <default/qj-1>
I0103 17:50:16.825214   29996 allocate.go:109] There are <2> nodes for Job <default/qj-1>
I0103 17:50:16.825222   29996 allocate.go:120] Considering Task <default/qj-1-8djdt> on node <ip-192-168-46-32.us-west-2.compute.internal>: <cpu 2000.00, memory 0.00, GPU 0.00> vs. <cpu 1690.00, memory 16568913920.00, GPU 0.00>
I0103 17:50:16.825264   29996 allocate.go:120] Considering Task <default/qj-1-8djdt> on node <ip-192-168-71-35.us-west-2.compute.internal>: <cpu 2000.00, memory 0.00, GPU 0.00> vs. <cpu 3890.00, memory 16715714560.00, GPU 0.00>
I0103 17:50:16.825296   29996 allocate.go:132] Binding Task <default/qj-1-8djdt> to node <ip-192-168-71-35.us-west-2.compute.internal>
I0103 17:50:16.825311   29996 session.go:170] After allocated Task <default/qj-1-8djdt> to Node <ip-192-168-71-35.us-west-2.compute.internal>: idle <cpu 1890.00, memory 16715714560.00, GPU 0.00>, used <cpu 2110.00, memory 0.00, GPU 0.00>, releasing <cpu 0.00, memory 0.00, GPU 0.00>

I0103 17:50:16.825327   29996 allocate.go:78] Try to allocate resource to Jobs in Queue <default>
I0103 17:50:16.825339   29996 allocate.go:102] Try to allocate resource to 2 tasks of Job <default/qj-1>
I0103 17:50:16.825348   29996 allocate.go:109] There are <2> nodes for Job <default/qj-1>
I0103 17:50:16.825355   29996 allocate.go:120] Considering Task <default/qj-1-rdvph> on node <ip-192-168-46-32.us-west-2.compute.internal>: <cpu 2000.00, memory 0.00, GPU 0.00> vs. <cpu 1690.00, memory 16568913920.00, GPU 0.00>
I0103 17:50:16.825398   29996 allocate.go:120] Considering Task <default/qj-1-rdvph> on node <ip-192-168-71-35.us-west-2.compute.internal>: <cpu 2000.00, memory 0.00, GPU 0.00> vs. <cpu 1890.00, memory 16715714560.00, GPU 0.00>
I0103 17:50:16.825442   29996 allocate.go:78] Try to allocate resource to Jobs in Queue <default>
I0103 17:50:16.825450   29996 allocate.go:81] Can not find jobs for queue default.
I0103 17:50:16.825460   29996 allocate.go:173] Leaving Allocate ...
I0103 17:50:16.825470   29996 backfill.go:41] Enter Backfill ...
I0103 17:50:16.825479   29996 backfill.go:71] Leaving Backfill ...
I0103 17:50:16.825485   29996 preempt.go:44] Enter Preempt ...
I0103 17:50:16.825492   29996 preempt.go:56] Added Queue <default> for Job <default/qj-1>
I0103 17:50:16.825531   29996 preempt.go:186] Considering Task <default/qj-1-rdvph> on Node <ip-192-168-46-32.us-west-2.compute.internal>.
I0103 17:50:16.825544   29996 gang.go:103] Victims from Gang plugins are []
I0103 17:50:16.825556   29996 preempt.go:200] No validated victims on Node <ip-192-168-46-32.us-west-2.compute.internal>: no victims
I0103 17:50:16.825617   29996 preempt.go:186] Considering Task <default/qj-1-rdvph> on Node <ip-192-168-71-35.us-west-2.compute.internal>.
I0103 17:50:16.825643   29996 gang.go:103] Victims from Gang plugins are []
I0103 17:50:16.825672   29996 preempt.go:200] No validated victims on Node <ip-192-168-71-35.us-west-2.compute.internal>: no victims
I0103 17:50:16.825773   29996 preempt.go:186] Considering Task <default/qj-1-6px6z> on Node <ip-192-168-46-32.us-west-2.compute.internal>.
I0103 17:50:16.825789   29996 gang.go:103] Victims from Gang plugins are []
I0103 17:50:16.825800   29996 preempt.go:200] No validated victims on Node <ip-192-168-46-32.us-west-2.compute.internal>: no victims
I0103 17:50:16.825838   29996 preempt.go:186] Considering Task <default/qj-1-6px6z> on Node <ip-192-168-71-35.us-west-2.compute.internal>.
I0103 17:50:16.825851   29996 gang.go:103] Victims from Gang plugins are []
I0103 17:50:16.825867   29996 preempt.go:200] No validated victims on Node <ip-192-168-71-35.us-west-2.compute.internal>: no victims
I0103 17:50:16.825882   29996 preempt.go:92] No preemptor task in job <default/qj-1>.
I0103 17:50:16.825893   29996 statement.go:195] Discarding operations ...
I0103 17:50:16.825906   29996 preempt.go:81] No preemptors in Queue <default>, break.
I0103 17:50:16.825919   29996 preempt.go:165] Leaving Preempt ...
I0103 17:50:16.825931   29996 gang.go:163] Gang: <default/qj-1> allocated: 2, pending: 2
I0103 17:50:16.826004   29996 session.go:307] Discard Job <default/qj-1> because 2/4 tasks in gang unschedulable: 0/2 nodes are available, 2 insufficient cpu.
I0103 17:50:16.826036   29996 pod_condition.go:23] Updating pod condition for default/qj-1-rdvph to (PodScheduled==False)
I0103 17:50:16.826070   29996 pod_condition.go:23] Updating pod condition for default/qj-1-6px6z to (PodScheduled==False)
I0103 17:50:16.826100   29996 pod_condition.go:23] Updating pod condition for default/qj-1-nd485 to (PodScheduled==False)
I0103 17:50:16.826131   29996 pod_condition.go:23] Updating pod condition for default/qj-1-8djdt to (PodScheduled==False)

Release note:

Add PodConditions for pending pods

contrib/DLaaS/pkg/scheduler/util.go

Jeffwan · 2019-01-04T02:29:31Z

pkg/scheduler/cache/cache.go

@@ -51,7 +51,7 @@ func New(config *rest.Config, schedulerName string, nsAsQueue bool) Cache {
 type SchedulerCache struct {
 	sync.Mutex

-	kubeclient *kubernetes.Clientset
+	Kubeclient *kubernetes.Clientset


I have to use clientset and recorder and I'd like to reuse libs from cache object. let me know if this is fine.

pkg/scheduler/plugins/gang/gang.go

pkg/scheduler/framework/session.go

Jeffwan · 2019-01-08T09:23:38Z

@k82cn I update the code to address review feedbacks. Please let me know if there's additional changes needed.

Move pod related logic inside cache.go
Add basic UTs (add fake and testify libs)

k82cn · 2019-01-10T08:46:33Z

@Jeffwan , would you help to manage those commits into only 2 commits, major logic and vendor?

Jeffwan · 2019-01-11T06:38:17Z

@k82cn Yeah. Definitely. I rearrange commits to two. One or code change and the other one for vensor dependencies.

Jeffwan · 2019-01-12T09:27:37Z

Looks like CI has updates but has not reflected on Github. @k82cn Do you have an idea of this problem?

pkg/scheduler/cache/cache.go

pkg/scheduler/framework/session.go

pkg/scheduler/plugins/gang/gang.go

Jeffwan · 2019-01-12T23:24:18Z

pkg/scheduler/framework/session.go

+	jobErrMsg := job.FitError()
+
+	// Update podCondition for tasks Allocated and Pending before job discarded
+	for _, taskInfo := range job.TaskStatusIndex[api.Pending] {


@k82cn Thanks for careful examination. I already make clean up and please have a look at this revision.

Move Task status update in Backoff and Allocated task included now.
The reason I use jobErrMsg and FailedSchedulingEvent here is to follow same message as default scheduler in case some components relied on it.

Leave ssn.TaskUnschedulable public for now in case somewhere outside use it to update status.

my pleasure :)

Jeffwan · 2019-01-13T06:04:07Z

Resolve vendor conflicts from #551

k82cn · 2019-01-13T07:41:48Z

/lgtm
/approve

k8s-ci-robot · 2019-01-13T07:41:55Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Jeffwan, k82cn

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [k82cn]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Add Unschedulable PodCondition for pods in pending

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jan 4, 2019

k8s-ci-robot requested review from dmatch01 and tizhou86 January 4, 2019 02:14

k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Jan 4, 2019

Jeffwan mentioned this pull request Jan 4, 2019

Add Pod Condition and unblock cluster autoscaler #526

Closed

Jeffwan commented Jan 4, 2019

View reviewed changes

k82cn reviewed Jan 4, 2019

View reviewed changes

pkg/scheduler/plugins/gang/gang.go Outdated Show resolved Hide resolved

k82cn reviewed Jan 7, 2019

View reviewed changes

pkg/scheduler/framework/session.go Outdated Show resolved Hide resolved

k82cn reviewed Jan 7, 2019

View reviewed changes

pkg/scheduler/framework/session.go Outdated Show resolved Hide resolved

Jeffwan mentioned this pull request Jan 7, 2019

REQUEST: New membership for Jeffwan kubernetes/org#334

Closed

6 tasks

Jeffwan force-pushed the pod_condition branch from c598485 to 118dd03 Compare January 8, 2019 08:25

k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jan 8, 2019

Jeffwan force-pushed the pod_condition branch 2 times, most recently from 2ef97c8 to 981868c Compare January 11, 2019 06:20

Jeffwan force-pushed the pod_condition branch from 981868c to a419905 Compare January 11, 2019 18:24

k82cn reviewed Jan 12, 2019

View reviewed changes

pkg/scheduler/cache/cache.go Outdated Show resolved Hide resolved

k82cn reviewed Jan 12, 2019

View reviewed changes

pkg/scheduler/cache/cache.go Outdated Show resolved Hide resolved

k82cn reviewed Jan 12, 2019

View reviewed changes

pkg/scheduler/framework/session.go Outdated Show resolved Hide resolved

k82cn reviewed Jan 12, 2019

View reviewed changes

pkg/scheduler/plugins/gang/gang.go Outdated Show resolved Hide resolved

Jeffwan force-pushed the pod_condition branch from a419905 to 5a33753 Compare January 12, 2019 23:18

Jeffwan commented Jan 12, 2019

View reviewed changes

Jeffwan added 2 commits January 12, 2019 21:42

Add Unschedulable PodCondition for pending and allocated tasks

1d3126a

Update vendors

6c85f6e

Jeffwan force-pushed the pod_condition branch from 5a33753 to 6c85f6e Compare January 13, 2019 06:03

k8s-ci-robot assigned k82cn Jan 13, 2019

k8s-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Jan 13, 2019

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 13, 2019

k8s-ci-robot merged commit 2e4c9e3 into kubernetes-retired:master Jan 13, 2019

k82cn added this to the v0.4 milestone Jan 26, 2019

kevin-wangzefeng pushed a commit to kevin-wangzefeng/scheduler that referenced this pull request Jun 28, 2019

Merge pull request kubernetes-retired#535 from Jeffwan/pod_condition

38c8002

Add Unschedulable PodCondition for pods in pending

kevin-wangzefeng pushed a commit to kevin-wangzefeng/scheduler that referenced this pull request Jun 28, 2019

Merge pull request kubernetes-retired#535 from Jeffwan/pod_condition

9fa1f81

Add Unschedulable PodCondition for pods in pending

kevin-wangzefeng pushed a commit to kevin-wangzefeng/scheduler that referenced this pull request Jun 28, 2019

Merge pull request kubernetes-retired#535 from Jeffwan/pod_condition

2c52a92

Add Unschedulable PodCondition for pods in pending

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Unschedulable PodCondition for pods in pending #535

Add Unschedulable PodCondition for pods in pending #535

Jeffwan commented Jan 4, 2019

Jeffwan Jan 4, 2019

Jeffwan commented Jan 8, 2019

k82cn commented Jan 10, 2019

Jeffwan commented Jan 11, 2019

Jeffwan commented Jan 12, 2019

Jeffwan Jan 12, 2019 •

edited

Loading

k82cn Jan 13, 2019

Jeffwan commented Jan 13, 2019

k82cn commented Jan 13, 2019

k8s-ci-robot commented Jan 13, 2019

Add Unschedulable PodCondition for pods in pending #535

Add Unschedulable PodCondition for pods in pending #535

Conversation

Jeffwan commented Jan 4, 2019

Jeffwan Jan 4, 2019

Choose a reason for hiding this comment

Jeffwan commented Jan 8, 2019

k82cn commented Jan 10, 2019

Jeffwan commented Jan 11, 2019

Jeffwan commented Jan 12, 2019

Jeffwan Jan 12, 2019 • edited Loading

Choose a reason for hiding this comment

k82cn Jan 13, 2019

Choose a reason for hiding this comment

Jeffwan commented Jan 13, 2019

k82cn commented Jan 13, 2019

k8s-ci-robot commented Jan 13, 2019

Jeffwan Jan 12, 2019 •

edited

Loading