Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gang scheduling job with high-priority not preempting lower priority jobs #2337

Closed
talcoh2x opened this issue Jul 7, 2022 · 20 comments
Closed
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.

Comments

@talcoh2x
Copy link

talcoh2x commented Jul 7, 2022

we run gang scheduling job with high-priority but we dont see that the default priory jobs releasing once we don't enough resources.

expected:
we expect that in such cases lower priority jobs are getting deleted.

Volcano version 1.6.0
K8s version 1.22/1.21

@talcoh2x talcoh2x added the kind/bug Categorizes issue or PR as related to a bug. label Jul 7, 2022
@william-wang
Copy link
Member

william-wang commented Jul 7, 2022

Thanks for your reporting, It seems the same issue with #2034, we are dealing with it.

@william-wang
Copy link
Member

@talcoh2x Would you like to provide the job yaml?

@snirkop89
Copy link

# High priority job
apiVersion: v1
items:
- apiVersion: kubeflow.org/v1
  kind: MPIJob
  metadata:
    creationTimestamp: "2022-07-07T10:33:13Z"
    generation: 1
    name: high-priority-mpijob
    namespace: app
    resourceVersion: "5940716"
    uid: 67724b4f-84d0-473d-bc62-317bf686fa90
  spec:
    mpiReplicaSpecs:
      Launcher:
        replicas: 1
        template:
          metadata:
            creationTimestamp: null
            name: high-priority-mpijob
          spec:
            affinity:
              podAntiAffinity:
                requiredDuringSchedulingIgnoredDuringExecution:
                - labelSelector:
                    matchExpressions:
                    - key: is-vmi
                      operator: In
                      values:
                      - "true"
                  topologyKey: kubernetes.io/hostname
            containers:
            - args:
              - sleep 1d
              # command shortened
              command:
              - mpirun
              - ...
              image: goodimage
              name: high-priority-mpijob
              resources:
                limits:
                  cpu: "2"
                  memory: 2Gi
                requests:
                  cpu: "2"
                  memory: 2Gi
              volumeMounts:
              - mountPath: /software
                name: software
            initContainers:
            - args:
              - mkdir -p /root/logs/launcher && ./dnswaiter
              command:
              - /bin/bash
              - -c
              image: goodimage
              imagePullPolicy: Always
              name: wait-dns
              resources:
                limits:
                  cpu: "5"
                  memory: 5Gi
                requests:
                  cpu: 100m
                  memory: 500Mi
              volumeMounts:
              - mountPath: /etc/mpi
                name: mpi-job-config
              - mountPath: /root/logs
                name: logs
              workingDir: /root
            restartPolicy: Never
      Worker:
        replicas: 1
        template:
          metadata:
            creationTimestamp: null
            name: high-priority-mpijob
          spec:
            affinity:
              podAntiAffinity:
                requiredDuringSchedulingIgnoredDuringExecution:
                - labelSelector:
                    matchExpressions:
                    - key: is-vmi
                      operator: In
                      values:
                      - "true"
                  topologyKey: kubernetes.io/hostname
            containers:
              image: goodimage
              name: high-priority-mpijob
              resources:
                limits:
                  cpu: "50"
                  habana.ai/gaudi: "4" # gpu
                  hugepages-2Mi: 100000Mi
              securityContext:
                privileged: true
              volumeMounts:
                # some mounts
              workingDir: /root
            initContainers:
            - args:
              - mkdir -p /root/logs/$HOSTNAME
              command:
              - /bin/bash
              - -c
              env:
              - name: DRIVER_WITH_NETWORK
                value: "false"
              image: goodimage
              imagePullPolicy: IfNotPresent
              name: prepare-node
              resources:
                limits:
                  cpu: "5"
                  memory: 5Gi
                requests:
                  cpu: "5"
                  memory: 5Gi
              securityContext:
                privileged: false
              volumeMounts:
              # mounts
              workingDir: /root
            priorityClassName: high
            schedulerName: volcano
            volumes:
            # some volumes
    runPolicy:
      backoffLimit: 0
      cleanPodPolicy: All
      ttlSecondsAfterFinished: 300
    slotsPerWorker: 8
  status:
    conditions:
    - lastTransitionTime: "2022-07-07T10:33:13Z"
      lastUpdateTime: "2022-07-07T10:33:13Z"
      message: MPIJob xxxx is created.
      reason: MPIJobCreated
      status: "True"
      type: Created
    replicaStatuses:
      Launcher: {}
      Worker: {}
    startTime: "2022-07-07T10:33:13Z"

# No priority jobs
- apiVersion: kubeflow.org/v1
  kind: MPIJob
  metadata:
    creationTimestamp: "2022-07-07T10:32:17Z"
    generation: 1
    name: no-priority-mpijob
    namespace: app
    resourceVersion: "5940444"
    uid: 9359f6ef-1bde-427d-b4bf-74a86fe3467a
  spec:
    mpiReplicaSpecs:
      Launcher:
        replicas: 1
        template:
          metadata:
            creationTimestamp: null
            name: no-priority-mpijob
          spec:
            affinity:
              podAntiAffinity:
                requiredDuringSchedulingIgnoredDuringExecution:
                - labelSelector:
                    matchExpressions:
                    - key: is-vmi
                      operator: In
                      values:
                      - "true"
                  topologyKey: kubernetes.io/hostname
            containers:
            - args:
              - sleep 1d
              command:
              - mpirun
              - --allow-run-as-root
              # ....
              image: goodimage
              name: no-priority-mpijob
              resources:
                limits:
                  cpu: "2"
                  memory: 2Gi
                requests:
                  cpu: "2"
                  memory: 2Gi
              volumeMounts:
              - mountPath: /software
                name: software
            initContainers:
            - args:
              - mkdir -p /root/logs/launcher && ./dnswaiter
              command:
              - /bin/bash
              - -c
              image: goodimage
              imagePullPolicy: Always
              name: wait-dns
              resources:
                limits:
                  cpu: "5"
                  memory: 5Gi
                requests:
                  cpu: 100m
                  memory: 500Mi
              volumeMounts:
              #mounts
              workingDir: /root
            restartPolicy: Never
            volumes:
            # some volumes
      Worker:
        replicas: 1
        template:
          metadata:
            creationTimestamp: null
            name: no-priority-mpijob
          spec:
            affinity:
              podAntiAffinity:
                requiredDuringSchedulingIgnoredDuringExecution:
                - labelSelector:
                    matchExpressions:
                    - key: is-vmi
                      operator: In
                      values:
                      - "true"
                  topologyKey: kubernetes.io/hostname
            containers:
              image: goodimage
              name: no-priority-mpijob
              resources:
                limits:
                  cpu: "50"
                  habana.ai/gaudi: "4" # gpu
                  hugepages-2Mi: 100000Mi
              securityContext:
                privileged: true
              volumeMounts:
              #mounts
              workingDir: /root
            initContainers:
            - args:
              - mkdir -p /root/logs/$HOSTNAME
              command:
              - /bin/bash
              - -c
              env:
              - name: DRIVER_WITH_NETWORK
                value: "false"
              image: goodimage
              imagePullPolicy: IfNotPresent
              name: prepare-node
              resources:
                limits:
                  cpu: "5"
                  memory: 5Gi
                requests:
                  cpu: "5"
                  memory: 5Gi
              securityContext:
                privileged: false
              volumeMounts:
              # mounts
              workingDir: /root
            schedulerName: volcano
            volumes:
            # volumes
    runPolicy:
      backoffLimit: 0
      cleanPodPolicy: All
      ttlSecondsAfterFinished: 300
    slotsPerWorker: 8
  status:
    conditions:
    - lastTransitionTime: "2022-07-07T10:32:17Z"
      lastUpdateTime: "2022-07-07T10:32:17Z"
      message: MPIJob xxx is created.
      reason: MPIJobCreated
      status: "True"
      type: Created
    replicaStatuses:
      Launcher: {}
      Worker:
        active: 1
    startTime: "2022-07-07T10:32:17Z"
kind: List
metadata:
  resourceVersion: ""
  selfLink: ""

@william-wang
Copy link
Member

/assign @waiterQ

@volcano-sh-bot
Copy link
Contributor

@william-wang: GitHub didn't allow me to assign the following users: waiterQ.

Note that only volcano-sh members, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time.
For more information please see the contributor guide

In response to this:

/assign @waiterQ

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@Thor-wl
Copy link
Contributor

Thor-wl commented Jul 12, 2022

@talcoh2x @snirkop89 I'm also taking a test about this bug. Can you provide your scheduler configuration?

@snirkop89
Copy link

sure:
I tried a few variations (found them in the issue here), both yielded the same result - which preemption doesn't occur:

apiVersion: v1
data:
  volcano-scheduler.conf: |
    actions: "enqueue, allocate, backfill, preempt"
    tiers:
    - plugins:
      - name: priority
      - name: gang
      - name: conformance
    - plugins:
      - name: overcommit
      - name: drf
      - name: predicates
      - name: proportion
      - name: nodeorder
      - name: binpack
kind: ConfigMap
metadata:
  annotations:
    meta.helm.sh/release-name: volcano
    meta.helm.sh/release-namespace: volcano-system
  creationTimestamp: "2022-06-29T11:24:44Z"
  labels:
    app.kubernetes.io/managed-by: Helm
  name: volcano-scheduler-configmap
  namespace: volcano-system
  resourceVersion: "5116689"
  uid: a60fff7a-6da2-4f7b-922c-da3447fae82f

apiVersion: v1
data:
  volcano-scheduler.conf: |
    actions: "enqueue, allocate, backfill, preempt"
    tiers:
    - plugins:
      - name: priority
      - name: gang
        enablePreemptable: false
      - name: conformance
    - plugins:
      - name: overcommit
      - name: drf
        enablePreemptable: false
      - name: predicates
      - name: proportion
      - name: nodeorder
      - name: binpack
kind: ConfigMap
metadata:
  annotations:
    meta.helm.sh/release-name: volcano
    meta.helm.sh/release-namespace: volcano-system
  creationTimestamp: "2022-06-29T11:24:44Z"
  labels:
    app.kubernetes.io/managed-by: Helm
  name: volcano-scheduler-configmap
  namespace: volcano-system
  resourceVersion: "5116689"
  uid: a60fff7a-6da2-4f7b-922c-da3447fae82f

@Thor-wl
Copy link
Contributor

Thor-wl commented Jul 19, 2022

@snirkop89 Hi, Snir. I've taken a look at the bug and preemption was broken indeed. There are several reasons about that. Firstly, the podgroup for job with high priority cannot convert from pending to inqueue. So the job has no chance to get resources. You can configure the scheduler as follows to disable jobEnqueued functions.

    actions: "enqueue, allocate, backfill, preempt"
    tiers:
    - plugins:
      - name: priority
      - name: gang
      - name: conformance
    - plugins:
      - name: overcommit
        enableJobEnqueued: false  ## disable jobEnqueued function for overcommit plugin 
      - name: drf
      - name: predicates
      - name: proportion
        enableJobEnqueued: false  ## disable jobEnqueued function for proportion plugin
      - name: nodeorder
      - name: binpack

As what I tested locally, it can make podgroup with high priority enter inqueue status. But preeption was still not working. I'll give more feedback as soon as the root reason is found.

@zhypku
Copy link

zhypku commented Jul 21, 2022

@Thor-wl Hi, I'm also studying the preemption behavior of Volcano, and found the same problem. It seems that the JobStarvingFn of the gang plugin forbids preemption from a job where ji.CheckTaskMinAvailablePipelined() is false.
image
image
I did find the log from the scheduler's log (in my test , the job default/priority-job has a higher priority but is waiting).
I0721 03:40:52.353591 1 job_info.go:773] Job default/priority-job Task default-nginx occupied 0 less than task min avaliable
Then I disabled the JobStarvingFn from the gang plugin by setting enableJobStarving: false. Then the preemption worked. So is this a by-design feature or a bug? Why a false return value of CheckTaskMinAvailablePipelined prohibits preemption?

@Thor-wl
Copy link
Contributor

Thor-wl commented Jul 21, 2022

@Thor-wl Hi, I'm also studying the preemption behavior of Volcano, and found the same problem. It seems that the JobStarvingFn of the gang plugin forbids preemption from a job where ji.CheckTaskMinAvailablePipelined() is false. image image I did find the log from the scheduler's log (in my test , the job default/priority-job has a higher priority but is waiting). I0721 03:40:52.353591 1 job_info.go:773] Job default/priority-job Task default-nginx occupied 0 less than task min avaliable Then I disabled the JobStarvingFn from the gang plugin by setting enableJobStarving: false. Then the preemption worked. So is this a by-design feature or a bug? Why a false return value of CheckTaskMinAvailablePipelined prohibits preemption?

Thanks for the feedback. That's what I also found yesterday. IMO, it's not something as expected. I'm tracking which commit and when this behavior is introduced.

@HecarimV
Copy link
Contributor

HecarimV commented Jul 25, 2022

if !task.Preemptable {
return false
}

Pods with Preemptable = false will not be preempted, but it seems that task.Preemptable is false by default if we don't set annotation or label.

@Thor-wl I don't know if this could be the problem.
Similar reclaim action may have problem as well, #2340
if !task.Preemptable {
continue
}

@Thor-wl
Copy link
Contributor

Thor-wl commented Jul 26, 2022

if !task.Preemptable {
return false
}

Pods with Preemptable = false will not be preempted, but it seems that task.Preemptable is false by default if we don't set annotation or label.

@Thor-wl I don't know if this could be the problem.
Similar reclaim action may have problem as well, #2340

if !task.Preemptable {
continue
}

In order to keep compatible with the former versions, task.Preemptable should be true by default. I've tracked the commits and the default value false was introduced here:

func GetPodPreemptable(pod *v1.Pod) bool {

@wpeng102 It seems that TDM plugin introduced the behavior. Let's take a review. Thanks!

@Thor-wl Thor-wl assigned wpeng102 and Thor-wl and unassigned Thor-wl Jul 26, 2022
@Thor-wl Thor-wl added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Jul 26, 2022
@snirkop89
Copy link

That's great to hear. Thank you for the fast response and feedback.

@Thor-wl
Copy link
Contributor

Thor-wl commented Jul 28, 2022

That's great to hear. Thank you for the fast response and feedback.

No worries. The fix is under discussion.

@snirkop89
Copy link

Hi, is there an update about this?

@talcoh2x
Copy link
Author

@william-wang @Thor-wl
Hi Guys do you have something new to update we are really stuck and need help

@talcoh2x
Copy link
Author

Hi, there is something new to update ?

@talcoh2x
Copy link
Author

@zhypku Hi, can you share with us the Volcano configuration you have and worked for you ? I mean the preemption flow

@stale
Copy link

stale bot commented Jan 21, 2023

Hello 👋 Looks like there was no activity on this issue for last 90 days.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity for 60 days, this issue will be closed (we can always reopen an issue if we need!).

@stale stale bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 21, 2023
@stale
Copy link

stale bot commented Mar 23, 2023

Closing for now as there was no activity for last 60 days after marked as stale, let us know if you need this to be reopened! 🤗

@stale stale bot closed this as completed Mar 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.
Projects
None yet
Development

No branches or pull requests

8 participants