Skip to content
This repository has been archived by the owner on May 25, 2023. It is now read-only.

Add Unschedulable PodCondition for pods in pending #535

Merged
merged 2 commits into from
Jan 13, 2019

Conversation

Jeffwan
Copy link
Contributor

@Jeffwan Jeffwan commented Jan 4, 2019

What this PR does / why we need it:
Please check #526 for details. Pod scheduled by kube-batch doesn't have rich pod conditions which makes other components like cluster autoscaler hard to interact with.

This PR will add PodConditions for pending pods in order for other kubernetes components to be aware.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #526

status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: 2019-01-04T01:49:02Z
    message: 0/2 nodes are available, 2 insufficient cpu.
    reason: Unschedulable
    status: "False"
    type: PodScheduled
  phase: Pending
  qosClass: Burstable

Special notes for your reviewer:
I tried following two alternatives, but both of them fails to create new nodes. At the time, CA detect pending pods but ignore scale up action after simulation. This is not accurate because it only consider resource putting pod one by one but not all the pending pods together. (Since CA project is designed for default scheduler, I am not sure if they will accept a change here)

Logs from CA

I0104 01:49:44.045065       1 utils.go:142] Pod qj-1-nd485 marked as unschedulable can be scheduled on ip-192-168-46-32.us-west-2.compute.internal. Ignoring in scale up.
I0104 01:49:44.045162       1 utils.go:128] Pod qj-1-8djdt marked as unschedulable can be scheduled (based on simulation run for other pod owned by the same controller). Ignoring in scale up.
I0104 01:49:44.045245       1 utils.go:128] Pod qj-1-rdvph marked as unschedulable can be scheduled (based on simulation run for other pod owned by the same controller). Ignoring in scale up.
I0104 01:49:44.045327       1 utils.go:128] Pod qj-1-6px6z marked as unschedulable can be scheduled (based on simulation run for other pod owned by the same controller). Ignoring in scale up.

Logs from kube-batch (2 pods allocated and 2 pending)

I0103 17:50:16.825058   29996 allocate.go:42] Enter Allocate ...
I0103 17:50:16.825068   29996 allocate.go:57] Added Job <default/qj-1> into Queue <default>
I0103 17:50:16.825077   29996 allocate.go:61] Try to allocate resource to 1 Queues

I0103 17:50:16.825085   29996 allocate.go:78] Try to allocate resource to Jobs in Queue <default>
I0103 17:50:16.825101   29996 allocate.go:102] Try to allocate resource to 4 tasks of Job <default/qj-1>
I0103 17:50:16.825111   29996 allocate.go:109] There are <2> nodes for Job <default/qj-1>
I0103 17:50:16.825119   29996 allocate.go:120] Considering Task <default/qj-1-nd485> on node <ip-192-168-46-32.us-west-2.compute.internal>: <cpu 2000.00, memory 0.00, GPU 0.00> vs. <cpu 3690.00, memory 16568913920.00, GPU 0.00>
I0103 17:50:16.825159   29996 allocate.go:132] Binding Task <default/qj-1-nd485> to node <ip-192-168-46-32.us-west-2.compute.internal>
I0103 17:50:16.825174   29996 session.go:170] After allocated Task <default/qj-1-nd485> to Node <ip-192-168-46-32.us-west-2.compute.internal>: idle <cpu 1690.00, memory 16568913920.00, GPU 0.00>, used <cpu 2310.00, memory 146800640.00, GPU 0.00>, releasing <cpu 0.00, memory 0.00, GPU 0.00>

I0103 17:50:16.825196   29996 allocate.go:78] Try to allocate resource to Jobs in Queue <default>
I0103 17:50:16.825205   29996 allocate.go:102] Try to allocate resource to 3 tasks of Job <default/qj-1>
I0103 17:50:16.825214   29996 allocate.go:109] There are <2> nodes for Job <default/qj-1>
I0103 17:50:16.825222   29996 allocate.go:120] Considering Task <default/qj-1-8djdt> on node <ip-192-168-46-32.us-west-2.compute.internal>: <cpu 2000.00, memory 0.00, GPU 0.00> vs. <cpu 1690.00, memory 16568913920.00, GPU 0.00>
I0103 17:50:16.825264   29996 allocate.go:120] Considering Task <default/qj-1-8djdt> on node <ip-192-168-71-35.us-west-2.compute.internal>: <cpu 2000.00, memory 0.00, GPU 0.00> vs. <cpu 3890.00, memory 16715714560.00, GPU 0.00>
I0103 17:50:16.825296   29996 allocate.go:132] Binding Task <default/qj-1-8djdt> to node <ip-192-168-71-35.us-west-2.compute.internal>
I0103 17:50:16.825311   29996 session.go:170] After allocated Task <default/qj-1-8djdt> to Node <ip-192-168-71-35.us-west-2.compute.internal>: idle <cpu 1890.00, memory 16715714560.00, GPU 0.00>, used <cpu 2110.00, memory 0.00, GPU 0.00>, releasing <cpu 0.00, memory 0.00, GPU 0.00>

I0103 17:50:16.825327   29996 allocate.go:78] Try to allocate resource to Jobs in Queue <default>
I0103 17:50:16.825339   29996 allocate.go:102] Try to allocate resource to 2 tasks of Job <default/qj-1>
I0103 17:50:16.825348   29996 allocate.go:109] There are <2> nodes for Job <default/qj-1>
I0103 17:50:16.825355   29996 allocate.go:120] Considering Task <default/qj-1-rdvph> on node <ip-192-168-46-32.us-west-2.compute.internal>: <cpu 2000.00, memory 0.00, GPU 0.00> vs. <cpu 1690.00, memory 16568913920.00, GPU 0.00>
I0103 17:50:16.825398   29996 allocate.go:120] Considering Task <default/qj-1-rdvph> on node <ip-192-168-71-35.us-west-2.compute.internal>: <cpu 2000.00, memory 0.00, GPU 0.00> vs. <cpu 1890.00, memory 16715714560.00, GPU 0.00>
I0103 17:50:16.825442   29996 allocate.go:78] Try to allocate resource to Jobs in Queue <default>
I0103 17:50:16.825450   29996 allocate.go:81] Can not find jobs for queue default.
I0103 17:50:16.825460   29996 allocate.go:173] Leaving Allocate ...
I0103 17:50:16.825470   29996 backfill.go:41] Enter Backfill ...
I0103 17:50:16.825479   29996 backfill.go:71] Leaving Backfill ...
I0103 17:50:16.825485   29996 preempt.go:44] Enter Preempt ...
I0103 17:50:16.825492   29996 preempt.go:56] Added Queue <default> for Job <default/qj-1>
I0103 17:50:16.825531   29996 preempt.go:186] Considering Task <default/qj-1-rdvph> on Node <ip-192-168-46-32.us-west-2.compute.internal>.
I0103 17:50:16.825544   29996 gang.go:103] Victims from Gang plugins are []
I0103 17:50:16.825556   29996 preempt.go:200] No validated victims on Node <ip-192-168-46-32.us-west-2.compute.internal>: no victims
I0103 17:50:16.825617   29996 preempt.go:186] Considering Task <default/qj-1-rdvph> on Node <ip-192-168-71-35.us-west-2.compute.internal>.
I0103 17:50:16.825643   29996 gang.go:103] Victims from Gang plugins are []
I0103 17:50:16.825672   29996 preempt.go:200] No validated victims on Node <ip-192-168-71-35.us-west-2.compute.internal>: no victims
I0103 17:50:16.825773   29996 preempt.go:186] Considering Task <default/qj-1-6px6z> on Node <ip-192-168-46-32.us-west-2.compute.internal>.
I0103 17:50:16.825789   29996 gang.go:103] Victims from Gang plugins are []
I0103 17:50:16.825800   29996 preempt.go:200] No validated victims on Node <ip-192-168-46-32.us-west-2.compute.internal>: no victims
I0103 17:50:16.825838   29996 preempt.go:186] Considering Task <default/qj-1-6px6z> on Node <ip-192-168-71-35.us-west-2.compute.internal>.
I0103 17:50:16.825851   29996 gang.go:103] Victims from Gang plugins are []
I0103 17:50:16.825867   29996 preempt.go:200] No validated victims on Node <ip-192-168-71-35.us-west-2.compute.internal>: no victims
I0103 17:50:16.825882   29996 preempt.go:92] No preemptor task in job <default/qj-1>.
I0103 17:50:16.825893   29996 statement.go:195] Discarding operations ...
I0103 17:50:16.825906   29996 preempt.go:81] No preemptors in Queue <default>, break.
I0103 17:50:16.825919   29996 preempt.go:165] Leaving Preempt ...
I0103 17:50:16.825931   29996 gang.go:163] Gang: <default/qj-1> allocated: 2, pending: 2
I0103 17:50:16.826004   29996 session.go:307] Discard Job <default/qj-1> because 2/4 tasks in gang unschedulable: 0/2 nodes are available, 2 insufficient cpu.
I0103 17:50:16.826036   29996 pod_condition.go:23] Updating pod condition for default/qj-1-rdvph to (PodScheduled==False)
I0103 17:50:16.826070   29996 pod_condition.go:23] Updating pod condition for default/qj-1-6px6z to (PodScheduled==False)
I0103 17:50:16.826100   29996 pod_condition.go:23] Updating pod condition for default/qj-1-nd485 to (PodScheduled==False)
I0103 17:50:16.826131   29996 pod_condition.go:23] Updating pod condition for default/qj-1-8djdt to (PodScheduled==False)

Release note:

Add PodConditions for pending pods

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jan 4, 2019
@k8s-ci-robot k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Jan 4, 2019
@@ -51,7 +51,7 @@ func New(config *rest.Config, schedulerName string, nsAsQueue bool) Cache {
type SchedulerCache struct {
sync.Mutex

kubeclient *kubernetes.Clientset
Kubeclient *kubernetes.Clientset
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have to use clientset and recorder and I'd like to reuse libs from cache object. let me know if this is fine.

@k8s-ci-robot k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jan 8, 2019
@Jeffwan
Copy link
Contributor Author

Jeffwan commented Jan 8, 2019

@k82cn I update the code to address review feedbacks. Please let me know if there's additional changes needed.

  • Move pod related logic inside cache.go
  • Add basic UTs (add fake and testify libs)

@k82cn
Copy link
Contributor

k82cn commented Jan 10, 2019

@Jeffwan , would you help to manage those commits into only 2 commits, major logic and vendor?

@Jeffwan Jeffwan force-pushed the pod_condition branch 2 times, most recently from 2ef97c8 to 981868c Compare January 11, 2019 06:20
@Jeffwan
Copy link
Contributor Author

Jeffwan commented Jan 11, 2019

@k82cn Yeah. Definitely. I rearrange commits to two. One or code change and the other one for vensor dependencies.

@Jeffwan
Copy link
Contributor Author

Jeffwan commented Jan 12, 2019

Looks like CI has updates but has not reflected on Github. @k82cn Do you have an idea of this problem?

jobErrMsg := job.FitError()

// Update podCondition for tasks Allocated and Pending before job discarded
for _, taskInfo := range job.TaskStatusIndex[api.Pending] {
Copy link
Contributor Author

@Jeffwan Jeffwan Jan 12, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@k82cn Thanks for careful examination. I already make clean up and please have a look at this revision.

Move Task status update in Backoff and Allocated task included now.
The reason I use jobErrMsg and FailedSchedulingEvent here is to follow same message as default scheduler in case some components relied on it.

Leave ssn.TaskUnschedulable public for now in case somewhere outside use it to update status.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my pleasure :)

@Jeffwan
Copy link
Contributor Author

Jeffwan commented Jan 13, 2019

Resolve vendor conflicts from #551

@k82cn
Copy link
Contributor

k82cn commented Jan 13, 2019

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Jan 13, 2019
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Jeffwan, k82cn

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 13, 2019
@k8s-ci-robot k8s-ci-robot merged commit 2e4c9e3 into kubernetes-retired:master Jan 13, 2019
@k82cn k82cn added this to the v0.4 milestone Jan 26, 2019
kevin-wangzefeng pushed a commit to kevin-wangzefeng/scheduler that referenced this pull request Jun 28, 2019
Add Unschedulable PodCondition for pods in pending
kevin-wangzefeng pushed a commit to kevin-wangzefeng/scheduler that referenced this pull request Jun 28, 2019
Add Unschedulable PodCondition for pods in pending
kevin-wangzefeng pushed a commit to kevin-wangzefeng/scheduler that referenced this pull request Jun 28, 2019
Add Unschedulable PodCondition for pods in pending
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm Indicates that a PR is ready to be merged. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add Pod Condition and unblock cluster autoscaler
3 participants