Gang scheduling job with high-priority not preempting lower priority jobs #2337

talcoh2x · 2022-07-07T07:20:44Z

we run gang scheduling job with high-priority but we dont see that the default priory jobs releasing once we don't enough resources.

expected:
we expect that in such cases lower priority jobs are getting deleted.

Volcano version 1.6.0
K8s version 1.22/1.21

william-wang · 2022-07-07T07:45:17Z

Thanks for your reporting, It seems the same issue with #2034, we are dealing with it.

william-wang · 2022-07-07T07:47:17Z

@talcoh2x Would you like to provide the job yaml?

snirkop89 · 2022-07-07T10:52:01Z

# High priority job
apiVersion: v1
items:
- apiVersion: kubeflow.org/v1
  kind: MPIJob
  metadata:
    creationTimestamp: "2022-07-07T10:33:13Z"
    generation: 1
    name: high-priority-mpijob
    namespace: app
    resourceVersion: "5940716"
    uid: 67724b4f-84d0-473d-bc62-317bf686fa90
  spec:
    mpiReplicaSpecs:
      Launcher:
        replicas: 1
        template:
          metadata:
            creationTimestamp: null
            name: high-priority-mpijob
          spec:
            affinity:
              podAntiAffinity:
                requiredDuringSchedulingIgnoredDuringExecution:
                - labelSelector:
                    matchExpressions:
                    - key: is-vmi
                      operator: In
                      values:
                      - "true"
                  topologyKey: kubernetes.io/hostname
            containers:
            - args:
              - sleep 1d
              # command shortened
              command:
              - mpirun
              - ...
              image: goodimage
              name: high-priority-mpijob
              resources:
                limits:
                  cpu: "2"
                  memory: 2Gi
                requests:
                  cpu: "2"
                  memory: 2Gi
              volumeMounts:
              - mountPath: /software
                name: software
            initContainers:
            - args:
              - mkdir -p /root/logs/launcher && ./dnswaiter
              command:
              - /bin/bash
              - -c
              image: goodimage
              imagePullPolicy: Always
              name: wait-dns
              resources:
                limits:
                  cpu: "5"
                  memory: 5Gi
                requests:
                  cpu: 100m
                  memory: 500Mi
              volumeMounts:
              - mountPath: /etc/mpi
                name: mpi-job-config
              - mountPath: /root/logs
                name: logs
              workingDir: /root
            restartPolicy: Never
      Worker:
        replicas: 1
        template:
          metadata:
            creationTimestamp: null
            name: high-priority-mpijob
          spec:
            affinity:
              podAntiAffinity:
                requiredDuringSchedulingIgnoredDuringExecution:
                - labelSelector:
                    matchExpressions:
                    - key: is-vmi
                      operator: In
                      values:
                      - "true"
                  topologyKey: kubernetes.io/hostname
            containers:
              image: goodimage
              name: high-priority-mpijob
              resources:
                limits:
                  cpu: "50"
                  habana.ai/gaudi: "4" # gpu
                  hugepages-2Mi: 100000Mi
              securityContext:
                privileged: true
              volumeMounts:
                # some mounts
              workingDir: /root
            initContainers:
            - args:
              - mkdir -p /root/logs/$HOSTNAME
              command:
              - /bin/bash
              - -c
              env:
              - name: DRIVER_WITH_NETWORK
                value: "false"
              image: goodimage
              imagePullPolicy: IfNotPresent
              name: prepare-node
              resources:
                limits:
                  cpu: "5"
                  memory: 5Gi
                requests:
                  cpu: "5"
                  memory: 5Gi
              securityContext:
                privileged: false
              volumeMounts:
              # mounts
              workingDir: /root
            priorityClassName: high
            schedulerName: volcano
            volumes:
            # some volumes
    runPolicy:
      backoffLimit: 0
      cleanPodPolicy: All
      ttlSecondsAfterFinished: 300
    slotsPerWorker: 8
  status:
    conditions:
    - lastTransitionTime: "2022-07-07T10:33:13Z"
      lastUpdateTime: "2022-07-07T10:33:13Z"
      message: MPIJob xxxx is created.
      reason: MPIJobCreated
      status: "True"
      type: Created
    replicaStatuses:
      Launcher: {}
      Worker: {}
    startTime: "2022-07-07T10:33:13Z"

# No priority jobs
- apiVersion: kubeflow.org/v1
  kind: MPIJob
  metadata:
    creationTimestamp: "2022-07-07T10:32:17Z"
    generation: 1
    name: no-priority-mpijob
    namespace: app
    resourceVersion: "5940444"
    uid: 9359f6ef-1bde-427d-b4bf-74a86fe3467a
  spec:
    mpiReplicaSpecs:
      Launcher:
        replicas: 1
        template:
          metadata:
            creationTimestamp: null
            name: no-priority-mpijob
          spec:
            affinity:
              podAntiAffinity:
                requiredDuringSchedulingIgnoredDuringExecution:
                - labelSelector:
                    matchExpressions:
                    - key: is-vmi
                      operator: In
                      values:
                      - "true"
                  topologyKey: kubernetes.io/hostname
            containers:
            - args:
              - sleep 1d
              command:
              - mpirun
              - --allow-run-as-root
              # ....
              image: goodimage
              name: no-priority-mpijob
              resources:
                limits:
                  cpu: "2"
                  memory: 2Gi
                requests:
                  cpu: "2"
                  memory: 2Gi
              volumeMounts:
              - mountPath: /software
                name: software
            initContainers:
            - args:
              - mkdir -p /root/logs/launcher && ./dnswaiter
              command:
              - /bin/bash
              - -c
              image: goodimage
              imagePullPolicy: Always
              name: wait-dns
              resources:
                limits:
                  cpu: "5"
                  memory: 5Gi
                requests:
                  cpu: 100m
                  memory: 500Mi
              volumeMounts:
              #mounts
              workingDir: /root
            restartPolicy: Never
            volumes:
            # some volumes
      Worker:
        replicas: 1
        template:
          metadata:
            creationTimestamp: null
            name: no-priority-mpijob
          spec:
            affinity:
              podAntiAffinity:
                requiredDuringSchedulingIgnoredDuringExecution:
                - labelSelector:
                    matchExpressions:
                    - key: is-vmi
                      operator: In
                      values:
                      - "true"
                  topologyKey: kubernetes.io/hostname
            containers:
              image: goodimage
              name: no-priority-mpijob
              resources:
                limits:
                  cpu: "50"
                  habana.ai/gaudi: "4" # gpu
                  hugepages-2Mi: 100000Mi
              securityContext:
                privileged: true
              volumeMounts:
              #mounts
              workingDir: /root
            initContainers:
            - args:
              - mkdir -p /root/logs/$HOSTNAME
              command:
              - /bin/bash
              - -c
              env:
              - name: DRIVER_WITH_NETWORK
                value: "false"
              image: goodimage
              imagePullPolicy: IfNotPresent
              name: prepare-node
              resources:
                limits:
                  cpu: "5"
                  memory: 5Gi
                requests:
                  cpu: "5"
                  memory: 5Gi
              securityContext:
                privileged: false
              volumeMounts:
              # mounts
              workingDir: /root
            schedulerName: volcano
            volumes:
            # volumes
    runPolicy:
      backoffLimit: 0
      cleanPodPolicy: All
      ttlSecondsAfterFinished: 300
    slotsPerWorker: 8
  status:
    conditions:
    - lastTransitionTime: "2022-07-07T10:32:17Z"
      lastUpdateTime: "2022-07-07T10:32:17Z"
      message: MPIJob xxx is created.
      reason: MPIJobCreated
      status: "True"
      type: Created
    replicaStatuses:
      Launcher: {}
      Worker:
        active: 1
    startTime: "2022-07-07T10:32:17Z"
kind: List
metadata:
  resourceVersion: ""
  selfLink: ""

william-wang · 2022-07-11T06:25:08Z

/assign @waiterQ

volcano-sh-bot · 2022-07-11T06:25:11Z

@william-wang: GitHub didn't allow me to assign the following users: waiterQ.

Note that only volcano-sh members, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time.
For more information please see the contributor guide

In response to this:

/assign @waiterQ

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Thor-wl · 2022-07-12T09:06:50Z

@talcoh2x @snirkop89 I'm also taking a test about this bug. Can you provide your scheduler configuration?

snirkop89 · 2022-07-17T09:18:32Z

sure:
I tried a few variations (found them in the issue here), both yielded the same result - which preemption doesn't occur:

apiVersion: v1
data:
  volcano-scheduler.conf: |
    actions: "enqueue, allocate, backfill, preempt"
    tiers:
    - plugins:
      - name: priority
      - name: gang
      - name: conformance
    - plugins:
      - name: overcommit
      - name: drf
      - name: predicates
      - name: proportion
      - name: nodeorder
      - name: binpack
kind: ConfigMap
metadata:
  annotations:
    meta.helm.sh/release-name: volcano
    meta.helm.sh/release-namespace: volcano-system
  creationTimestamp: "2022-06-29T11:24:44Z"
  labels:
    app.kubernetes.io/managed-by: Helm
  name: volcano-scheduler-configmap
  namespace: volcano-system
  resourceVersion: "5116689"
  uid: a60fff7a-6da2-4f7b-922c-da3447fae82f

apiVersion: v1
data:
  volcano-scheduler.conf: |
    actions: "enqueue, allocate, backfill, preempt"
    tiers:
    - plugins:
      - name: priority
      - name: gang
        enablePreemptable: false
      - name: conformance
    - plugins:
      - name: overcommit
      - name: drf
        enablePreemptable: false
      - name: predicates
      - name: proportion
      - name: nodeorder
      - name: binpack
kind: ConfigMap
metadata:
  annotations:
    meta.helm.sh/release-name: volcano
    meta.helm.sh/release-namespace: volcano-system
  creationTimestamp: "2022-06-29T11:24:44Z"
  labels:
    app.kubernetes.io/managed-by: Helm
  name: volcano-scheduler-configmap
  namespace: volcano-system
  resourceVersion: "5116689"
  uid: a60fff7a-6da2-4f7b-922c-da3447fae82f

Thor-wl · 2022-07-19T01:45:56Z

@snirkop89 Hi, Snir. I've taken a look at the bug and preemption was broken indeed. There are several reasons about that. Firstly, the podgroup for job with high priority cannot convert from pending to inqueue. So the job has no chance to get resources. You can configure the scheduler as follows to disable jobEnqueued functions.

    actions: "enqueue, allocate, backfill, preempt"
    tiers:
    - plugins:
      - name: priority
      - name: gang
      - name: conformance
    - plugins:
      - name: overcommit
        enableJobEnqueued: false  ## disable jobEnqueued function for overcommit plugin 
      - name: drf
      - name: predicates
      - name: proportion
        enableJobEnqueued: false  ## disable jobEnqueued function for proportion plugin
      - name: nodeorder
      - name: binpack

As what I tested locally, it can make podgroup with high priority enter inqueue status. But preeption was still not working. I'll give more feedback as soon as the root reason is found.

zhypku · 2022-07-21T06:52:40Z

@Thor-wl Hi, I'm also studying the preemption behavior of Volcano, and found the same problem. It seems that the JobStarvingFn of the gang plugin forbids preemption from a job where ji.CheckTaskMinAvailablePipelined() is false.

I did find the log from the scheduler's log (in my test , the job default/priority-job has a higher priority but is waiting).
I0721 03:40:52.353591 1 job_info.go:773] Job default/priority-job Task default-nginx occupied 0 less than task min avaliable
Then I disabled the JobStarvingFn from the gang plugin by setting enableJobStarving: false. Then the preemption worked. So is this a by-design feature or a bug? Why a false return value of CheckTaskMinAvailablePipelined prohibits preemption?

Thor-wl · 2022-07-21T07:12:21Z

@Thor-wl Hi, I'm also studying the preemption behavior of Volcano, and found the same problem. It seems that the JobStarvingFn of the gang plugin forbids preemption from a job where ji.CheckTaskMinAvailablePipelined() is false. I did find the log from the scheduler's log (in my test , the job default/priority-job has a higher priority but is waiting). I0721 03:40:52.353591 1 job_info.go:773] Job default/priority-job Task default-nginx occupied 0 less than task min avaliable Then I disabled the JobStarvingFn from the gang plugin by setting enableJobStarving: false. Then the preemption worked. So is this a by-design feature or a bug? Why a false return value of CheckTaskMinAvailablePipelined prohibits preemption?

Thanks for the feedback. That's what I also found yesterday. IMO, it's not something as expected. I'm tracking which commit and when this behavior is introduced.

HecarimV · 2022-07-25T09:45:40Z

volcano/pkg/scheduler/actions/preempt/preempt.go

Lines 124 to 126 in 1b26306

    
           if !task.Preemptable { 
        
           	return false 
        
           }

Pods with Preemptable = false will not be preempted, but it seems that task.Preemptable is false by default if we don't set annotation or label.

volcano/pkg/scheduler/api/pod_info.go

Line 101 in 1b26306

return false

@Thor-wl I don't know if this could be the problem.
Similar reclaim action may have problem as well, #2340

volcano/pkg/scheduler/actions/reclaim/reclaim.go

Lines 135 to 137 in 1b26306

    
           if !task.Preemptable { 
        
           	continue 
        
           }

Thor-wl · 2022-07-26T01:02:49Z

volcano/pkg/scheduler/actions/preempt/preempt.go

Lines 124 to 126 in 1b26306

if !task.Preemptable {

return false

}

Pods with Preemptable = false will not be preempted, but it seems that task.Preemptable is false by default if we don't set annotation or label.

volcano/pkg/scheduler/api/pod_info.go

Line 101 in 1b26306

return false

@Thor-wl I don't know if this could be the problem.
Similar reclaim action may have problem as well, #2340

volcano/pkg/scheduler/actions/reclaim/reclaim.go

Lines 135 to 137 in 1b26306

if !task.Preemptable {

continue

}

In order to keep compatible with the former versions, task.Preemptable should be true by default. I've tracked the commits and the default value false was introduced here:

volcano/pkg/scheduler/api/pod_info.go

Line 76 in 2bb5ac7

func GetPodPreemptable(pod *v1.Pod) bool {

@wpeng102 It seems that TDM plugin introduced the behavior. Let's take a review. Thanks!

snirkop89 · 2022-07-27T10:47:25Z

That's great to hear. Thank you for the fast response and feedback.

Thor-wl · 2022-07-28T01:28:37Z

That's great to hear. Thank you for the fast response and feedback.

No worries. The fix is under discussion.

snirkop89 · 2022-09-08T10:42:47Z

Hi, is there an update about this?

talcoh2x · 2022-09-29T12:02:55Z

@william-wang @Thor-wl
Hi Guys do you have something new to update we are really stuck and need help

talcoh2x · 2022-10-19T07:37:24Z

Hi, there is something new to update ?

talcoh2x · 2022-10-19T07:41:22Z

@zhypku Hi, can you share with us the Volcano configuration you have and worked for you ? I mean the preemption flow

stale · 2023-01-21T10:47:46Z

Hello 👋 Looks like there was no activity on this issue for last 90 days.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity for 60 days, this issue will be closed (we can always reopen an issue if we need!).

stale · 2023-03-23T04:12:24Z

Closing for now as there was no activity for last 60 days after marked as stale, let us know if you need this to be reopened! 🤗

talcoh2x added the kind/bug Categorizes issue or PR as related to a bug. label Jul 7, 2022

Thor-wl self-assigned this Jul 12, 2022

waiterQ mentioned this issue Jul 15, 2022

Request to be an member of Volcano community volcano-sh/community#30

Closed

Thor-wl assigned wpeng102 and Thor-wl and unassigned Thor-wl Jul 26, 2022

Thor-wl added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Jul 26, 2022

RamezesDong mentioned this issue Nov 2, 2022

added priority capability for reclaim action #2262

Open

stale bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 21, 2023

stale bot closed this as completed Mar 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gang scheduling job with high-priority not preempting lower priority jobs #2337

Gang scheduling job with high-priority not preempting lower priority jobs #2337

talcoh2x commented Jul 7, 2022

william-wang commented Jul 7, 2022 •

edited

Loading

william-wang commented Jul 7, 2022

snirkop89 commented Jul 7, 2022

william-wang commented Jul 11, 2022

volcano-sh-bot commented Jul 11, 2022

Thor-wl commented Jul 12, 2022

snirkop89 commented Jul 17, 2022

Thor-wl commented Jul 19, 2022

zhypku commented Jul 21, 2022

Thor-wl commented Jul 21, 2022

HecarimV commented Jul 25, 2022 •

edited

Loading

Thor-wl commented Jul 26, 2022 •

edited

Loading

snirkop89 commented Jul 27, 2022

Thor-wl commented Jul 28, 2022

snirkop89 commented Sep 8, 2022

talcoh2x commented Sep 29, 2022

talcoh2x commented Oct 19, 2022

talcoh2x commented Oct 19, 2022

stale bot commented Jan 21, 2023

stale bot commented Mar 23, 2023

Gang scheduling job with high-priority not preempting lower priority jobs #2337

Gang scheduling job with high-priority not preempting lower priority jobs #2337

Comments

talcoh2x commented Jul 7, 2022

william-wang commented Jul 7, 2022 • edited Loading

william-wang commented Jul 7, 2022

snirkop89 commented Jul 7, 2022

william-wang commented Jul 11, 2022

volcano-sh-bot commented Jul 11, 2022

Thor-wl commented Jul 12, 2022

snirkop89 commented Jul 17, 2022

Thor-wl commented Jul 19, 2022

zhypku commented Jul 21, 2022

Thor-wl commented Jul 21, 2022

HecarimV commented Jul 25, 2022 • edited Loading

Thor-wl commented Jul 26, 2022 • edited Loading

snirkop89 commented Jul 27, 2022

Thor-wl commented Jul 28, 2022

snirkop89 commented Sep 8, 2022

talcoh2x commented Sep 29, 2022

talcoh2x commented Oct 19, 2022

talcoh2x commented Oct 19, 2022

stale bot commented Jan 21, 2023

stale bot commented Mar 23, 2023

william-wang commented Jul 7, 2022 •

edited

Loading

HecarimV commented Jul 25, 2022 •

edited

Loading

Thor-wl commented Jul 26, 2022 •

edited

Loading