Smaller jobs are scheduled ahead of higher priority jobs #2052

tFable · 2022-03-04T14:08:43Z

What happened:
For queued jobs in a single queue and in a single namespace, jobs requesting fewer resources are being favored ahead of jobs with a higher value in their priorityClassName.

When all queued jobs have the same resource requests, the priorityClassName is being honored (jobs with higher value in priorityClassName are executed ahead of jobs with a lower priorityClassName.

However, jobs requesting fewer resources are being favored ahead of jobs with a higher value in their priorityClassName.

From my relatively new knowledge of Volcano, I presume that would mean that the DRF algorithm seems to be applied ahead evaluating the priorityClassName maybe?

Please note that we are not looking for preemption/eviction of running pods (the preempt action is disabled in the configmap as you'll se below)

What you expected to happen:
Jobs that have a priorityClassName with a higher value, to be scheduled ahead of all other jobs in the same queue and same namespace regardless of resource requests per job.

How to reproduce it (as minimally and precisely as possible):

Create the following two priority classes

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: high-priority
value: 5000
globalDefault: false
description: "Highest Priority Jobs"
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: reg-priority
value: 1000
globalDefault: false
description: "Regular Priority Jobs"

For simplicity, deploy a vcjob that takes up the entire cluster (for me, that's 35 pods) and assign it priorityClassName: reg-priority. The job will start running. (The replicas of the task matches the job-level minAvailable`)

apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
  name: test-job1
spec:
  priorityClassName: reg-priority
  minAvailable: 35
  schedulerName: volcano
  ...

Deploy one or more jobs with small resource request and still reg-priority. The replicas of the task match the minAvailable

apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
  name: test-job2
spec:
  priorityClassName: reg-priority
  minAvailable: 12
  schedulerName: volcano
  ...

Now, test-job2 is pending since test-job1 is taking up the entire cluster.

Deploy a 3rd job that asks for the entire cluster but, this time, apply a higher priority to the job

apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
  name: test-job3
spec:
  priorityClassName: high-priority
  minAvailable: 35
  schedulerName: volcano

Delete test-job1
Watch test-job2 be executed and test-job3 remains pending

Please note that I have verified this with multiple jobs to ensure that FIFO is not just kicking in. Consistently, the jobs with fewer resources are being scheduled ahead even if other jobs have a higher priority class.
Anything else we need to know?:
The configmaps look like this:

kubectl get configmaps -n volcano-system volcano-scheduler-configmap -oyaml
apiVersion: v1
data:
  volcano-scheduler.conf: |
    actions: "enqueue, allocate, backfill"
    tiers:
    - plugins:
      - name: priority
      - name: gang
      - name: conformance
    - plugins:
      - name: overcommit
      - name: drf
      - name: predicates
      - name: proportion
      - name: nodeorder
      - name: binpack
kind: ConfigMap
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"v1","data":{"volcano-scheduler.conf":"actions: \"enqueue, allocate, backfill\"\ntiers:\n- plugins:\n  - name: priority\n  - name: gang\n  - name: conformance\n- plugins:\n  - name: overcommit\n  - name: drf\n  - name: predicates\n  - name: proportion\n  - name: nodeorder\n  - name: binpack\n"},"kind":"ConfigMap","metadata":{"annotations":{},"name":"volcano-scheduler-configmap","namespace":"volcano-system"}}
  creationTimestamp: "2022-03-04T12:00:25Z"
  name: volcano-scheduler-configmap
  namespace: volcano-system
  resourceVersion: "1715819"

and

kubectl get configmaps -n volcano-system volcano-admission-configmap -oyaml 
apiVersion: v1
data:
  volcano-admission.conf: |
    #resourceGroups:
    #- resourceGroup: management                    # set the resource group name
    #  object:
    #    key: namespace                             # set the field and the value to be matched
    #    value:
    #    - mng-ns-1
    #  schedulerName: default-scheduler             # set the scheduler for patching
    #  tolerations:                                 # set the tolerations for patching
    #  - effect: NoSchedule
    #    key: taint
    #    operator: Exists
    #  labels:
    #    volcano.sh/nodetype: management           # set the nodeSelector for patching
    #- resourceGroup: cpu
    #  object:
    #    key: annotation
    #    value:
    #    - "volcano.sh/resource-group: cpu"
    #  schedulerName: volcano
    #  labels:
    #    volcano.sh/nodetype: cpu
    #- resourceGroup: gpu                          # if the object is unsetted, default is:  the key is annotation,
    #  schedulerName: volcano                      # the annotation key is fixed and is "volcano.sh/resource-group", The corresponding value is the resourceGroup field
    #  labels:
    #    volcano.sh/nodetype: gpu
kind: ConfigMap
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"v1","data":{"volcano-admission.conf":"#resourceGroups:\n#- resourceGroup: management                    # set the resource group name\n#  object:\n#    key: namespace                             # set the field and the value to be matched\n#    value:\n#    - mng-ns-1\n#  schedulerName: default-scheduler             # set the scheduler for patching\n#  tolerations:                                 # set the tolerations for patching\n#  - effect: NoSchedule\n#    key: taint\n#    operator: Exists\n#  labels:\n#    volcano.sh/nodetype: management           # set the nodeSelector for patching\n#- resourceGroup: cpu\n#  object:\n#    key: annotation\n#    value:\n#    - \"volcano.sh/resource-group: cpu\"\n#  schedulerName: volcano\n#  labels:\n#    volcano.sh/nodetype: cpu\n#- resourceGroup: gpu                          # if the object is unsetted, default is:  the key is annotation,\n#  schedulerName: volcano                      # the annotation key is fixed and is \"volcano.sh/resource-group\", The corresponding value is the resourceGroup field\n#  labels:\n#    volcano.sh/nodetype: gpu\n"},"kind":"ConfigMap","metadata":{"annotations":{},"name":"volcano-admission-configmap","namespace":"volcano-system"}}
  creationTimestamp: "2022-03-04T12:00:24Z"
  name: volcano-admission-configmap
  namespace: volcano-system
  resourceVersion: "1715769"
  uid: 6b364011-0d92-452d-b6c6-93242ac67cbb

queue config

kubectl -n volcano-system get queues.scheduling.volcano.sh q1 -oyaml
apiVersion: scheduling.volcano.sh/v1beta1
kind: Queue
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"scheduling.volcano.sh/v1beta1","kind":"Queue","metadata":{"annotations":{},"name":"q1"},"spec":{"reclaimable":true,"weight":1}}
  creationTimestamp: "2022-03-04T12:00:33Z"
  generation: 1
  name: q1
  resourceVersion: "1721738"
  uid: c988a11f-8b81-4b5f-addf-290b699dda22
spec:
  reclaimable: true
  weight: 1
status:
  pending: 3
  reservation: {}
  running: 1
  state: Open

Environment:

Volcano Version: 1.5
Kubernetes version (use kubectl version): 1.23
Cloud provider or hardware configuration: vSphere on-prem
OS (e.g. from /etc/os-release):

NAME="Ubuntu"
VERSION="20.04.3 LTS (Focal Fossa)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 20.04.3 LTS"
VERSION_ID="20.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=focal
UBUNTU_CODENAME=focal

Kernel (e.g. uname -a): 5.13.0-28-generic #31~20.04.1-Ubuntu SMP Wed Jan 19 14:08:10 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
Install tools:
Others:

The text was updated successfully, but these errors were encountered:

tFable · 2022-03-04T14:40:15Z

It's worth noting that the individual tasks don't have a priorityClassName defined.

Here is the entire job template with the tasks:

apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
  name: test-job4
spec:
  priorityClassName: reg-priority
  minAvailable: 9
  schedulerName: volcano
  maxRetry: 5
  queue: q1
  tasks:
    - replicas: 1
      name: "test-head"
      template:
        metadata:
          labels:
            component: test-job4-ray-head
            type: ray
        spec:
          volumes:
            - name: dshm
              emptyDir:
                medium: Memory
            - name: envy
              hostPath:
                path: /envy
            - name: q
              hostPath:
                path: /q
            - name: auto
              hostPath:
                path: /auto
          containers:
            - name: ray-head
              image: ray/ray
              imagePullPolicy: IfNotPresent
              command: [ "/bin/bash", "-c", "--" ]
              args:
                - "ray start --head --port=6379 --redis-shard-ports=6380,6381 --num-cpus=$MY_CPU_REQUEST --object-manager-port=22345 --node-manager-port=22346 --dashboard-host=0.0.0.0 --block"
              ports:
                - containerPort: 6379 # Redis port
                - containerPort: 10001 # Used by Ray Client
                - containerPort: 8265 # Used by Ray Dashboard
              volumeMounts:
                - mountPath: /dev/shm
                  name: dshm
                - mountPath: /envy
                  mountPropagation: HostToContainer
                  name: envy
                - mountPath: /q
                  mountPropagation: HostToContainer
                  name: q
                - mountPath: /auto
                  mountPropagation: HostToContainer
                  name: auto
              env:
                - name: MY_CPU_REQUEST
                  valueFrom:
                    resourceFieldRef:
                      resource: requests.cpu
              resources:
                requests:
                  cpu: 1
                  memory: 512Mi
    - replicas: 8
      name: "test-workers"
      dependsOn:
        name:
          - "test-head"
      template:
        metadata:
          name: web
        spec:
          restartPolicy: Never
          volumes:
          - name: dshm
            emptyDir:
              medium: Memory
          - name: envy
            hostPath:
              path: /envy
          - name: q
            hostPath:
              path: /q
          - name: auto
            hostPath:
              path: /auto
          containers:
           - name: ray-head
             image: ray/ray
             imagePullPolicy: Always
             command: [ "/bin/bash", "-c", "--" ]
             args:
               - "ray start --num-cpus=$MY_CPU_REQUEST --address=$TEST_JOB4_RAY_HEAD_SERVICE_HOST:$TEST_JOB4_RAY_HEAD_SERVICE_PORT_REDIS --object-manager-port=22345 --node-manager-port=22346 --block"
             ports:
               - containerPort: 6379 # Redis port
               - containerPort: 10001 # Used by Ray Client
               - containerPort: 8265 # Used by Ray Dashboard
             volumeMounts:
               - mountPath: /dev/shm
                 name: dshm
               - mountPath: /envy
                 mountPropagation: HostToContainer
                 name: envy
               - mountPath: /q
                 mountPropagation: HostToContainer
                 name: q
               - mountPath: /auto
                 mountPropagation: HostToContainer
                 name: auto

             env:
               - name: MY_CPU_REQUEST
                 valueFrom:
                   resourceFieldRef:
                     resource: requests.cpu
             resources:
               requests:
                 cpu: 1
                 memory: 512Mi

ldd91 · 2022-03-04T17:52:54Z

You can add preempt action and have a try

tFable · 2022-03-04T17:56:42Z

Hi @ldd91,
Thanks for the message.
I'm specifically looking for already-running jobs/pods to not be evicted/preempted (our software doesn't support it).

Will adding the preempt action evict lower-prio jobs (or some of their pods) as soon as higher priority jobs get scheduled?

For context: The job-level minAvailable and the sum of all task replicas are equal.

Thanks again!
t

tFable · 2022-03-04T18:01:35Z

FWIW, I did add the preempt action and the behavior remained the same.

Here's the full config map (I re-installed Volcano to apply this so the change is picked up)

kubectl get configmaps -n volcano-system volcano-scheduler-configmap -o yaml
apiVersion: v1
data:
  volcano-scheduler.conf: |
    actions: "enqueue, allocate, backfill, preempt"
    tiers:
    - plugins:
      - name: priority
      - name: gang
      - name: conformance
    - plugins:
      - name: overcommit
      - name: drf
      - name: predicates
      - name: proportion
      - name: nodeorder
      - name: binpack
kind: ConfigMap
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"v1","data":{"volcano-scheduler.conf":"actions: \"enqueue, allocate, backfill, preempt\"\ntiers:\n- plugins:\n  - name: priority\n  - name: gang\n  - name: conformance\n- plugins:\n  - name: overcommit\n  - name: drf\n  - name: predicates\n  - name: proportion\n  - name: nodeorder\n  - name: binpack\n"},"kind":"ConfigMap","metadata":{"annotations":{},"name":"volcano-scheduler-configmap","namespace":"volcano-system"}}
  creationTimestamp: "2022-03-04T17:57:43Z"
  name: volcano-scheduler-configmap
  namespace: volcano-system
  resourceVersion: "1764191"
  uid: f086f7b9-af66-4e02-968a-da235f9829f6

tFable · 2022-03-05T20:49:22Z

This seems to be happening even with the drf plugin is not added in the list of plugins. I'm essentially unable to make Volcano not prefer smaller jobs ahead of larger ones no matter what I've tried. Smaller jobs always go in first.

I'm perplexed and any help would be greatly appreciated. This is currently a blocker for us to get Volcano out of PoC.

tFable · 2022-03-05T23:13:13Z

The plot thickens...

I now believe my issue is related to what the folks in #1901 are seeing.

I think the root cause is the the overcommit-factor of 1.2. This is what I believe happens.

Job 1, that takes up the entire cluster starts running (priority regular)
job2 that asks for the entire cluster is queued (priority, regular)
job3 that asks for a small number of pods is queued (priority: regular)
job4 that asks for the entire cluster is queued (priority: HIGH)

When job1 starts running, I also see pending pods from job3. I also see job3 in inqueue status whereas job 2 and 4 stay in pending while job1 is running.

If I rerun the scenario and I run job1 (entire cluster) but chage job3 to a large number of pods (but not the entire cluster), then jobs 2,3,4 all stay in pending state and there are no pending pods (only the pods from job1 show up in running state).

So what I believe is happening when job3 is small number of pods is that the podgroup is small enough to fit in the default overcommit-factor of 1.2 and thus those pods are in pending state when job1 is running. Because those pods are already in pending state, when job1 is removed, job3 pods and NOT job4 pods (which have higher priority) start running.

I tried changing the overcommit-factor to 1.0 following the example here, however, this doesn't result in a change in behavior.

tFable · 2022-03-07T00:44:05Z

I am not able to completely disable overcommit. Even when I set

- plugins:
      - name: overcommit
        arguments:
          "overcommit-factor": 1.0

And when job1 is designed to take up the entire cluster, if job2 is small enough, it will also go in inqueue mode and have pods pending. It seems to me that the overcommit factor never goes to 1 (i.e. no overcommit)

That said, I'm seeing the behavior (small low-prio job runs ahead of large high-prio job) even when job2 is NOT small enough to fit in the overcommit.

The only way I've been able to get around this is to add an sla to the higher job

annotations:
    sla-waiting-time: 2h

ldd91 · 2022-03-07T01:45:03Z

Sorry for i didn't see your scenario doesn't support preemption/eviction,I'll try to reproduce it

ldd91 · 2022-03-07T02:15:42Z

Here is my confimap

kubectl get configmap volcano-scheduler-configmap  -n volcano-system  -o yaml
apiVersion: v1
data:
  volcano-scheduler.conf: |
    actions: "enqueue, allocate, preempt, reclaim, backfill"
    tiers:
    - plugins:
      - name: priority
      - name: gang
      - name: conformance
    - plugins:
      - name: overcommit
      - name: drf
      - name: predicates
      - name: proportion
      - name: nodeorder
      - name: binpack
kind: ConfigMap
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"v1","data":{"volcano-scheduler.conf":"actions: \"enqueue, allocate, backfill\"\ntiers:\n- plugins:\n  - name: priority\n  - name: gang\n  - name: conformance\n- plugins:\n  - name: overcommit\n  - name: drf\n  - name: predicates\n  - name: proportion\n  - name: nodeorder\n  - name: binpack\n"},"kind":"ConfigMap","metadata":{"annotations":{},"name":"volcano-scheduler-configmap","namespace":"volcano-system"}}
  creationTimestamp: "2022-02-21T14:34:47Z"
  name: volcano-scheduler-configmap
  namespace: volcano-system
  resourceVersion: "11819858"
  uid: c5a14834-298e-47f1-b495-c36ec67eb33c

cat low-priority.yaml
apiVersion: scheduling.k8s.io/v1beta1
kind: PriorityClass
metadata:
  name: low-pri
value: -1
[root@7c-fe-90-be-b2-00 priority]#
[root@7c-fe-90-be-b2-00 priority]#
[root@7c-fe-90-be-b2-00 priority]# cat high-priority.yaml
apiVersion: scheduling.k8s.io/v1beta1
kind: PriorityClass
metadata:
  name: high-pri
preemptionPolicy: PreemptLowerPriority
value: 1000

In my scenario,high priority job can be scheduler ahead of low priority job(low priority job has less resource request and was submited the first )

ldd91 · 2022-03-07T02:34:58Z

In My scenario if the high priority job's request resource was greater than idle resource in cluster,the low priority (request less than idle resource in cluster)will be inqueue and high priority job is pending

tFable · 2022-03-07T02:49:52Z

Thanks so much for elaborating @ldd91!

In My scenario if the high priority job's request resource was greater than idle resource in cluster, the low priority (request less than idle resource in cluster)will be inqueue and high priority job is pending

This is also what I'm seeing. To confirm, when you say "idle resource", does that include the resources "created" by the overcommit-factor (1.2 by default)?

That is, if overcommit-factor is truly 1.0, then both the lower and higher priority jobs should be in pending if the cluster is completely utilized, correct?

tFable · 2022-03-07T02:55:22Z

In My scenario if the high priority job's request resource was greater than idle resource in cluster, the low priority (request less than idle resource in cluster)will be inqueue and high priority job is pending

Also, even if the low priority job doesn't fit in the cluster's idle resource (including overcommit-factor) and both low and high priority jobs were pending, I still observed the smaller, low-priority job started ahead of the high-priority job. Please note that the low priority job was submitted before the high-priority job.

I wonder if this is due to the fact that, as pods from the deleted jobs were being deleted as well, the cluster resources became quicker available for the smaller, lower priority job than for the higher priority but larger job. This is what @jiangkaihua mentioned here. I'll be trying his suggestion to move gang lower now and test again.

jiangkaihua · 2022-03-07T03:39:21Z

@tFable overcommit-factor is 1.2 by default, and should be no less than 1.0. So 1.0 should work and block overmuch jobs going through enqueue action and creating pods. So I guess it may be caused by several possibilities：

the resource request of volcano job was not correct in yaml, then minResource of this job would be 0 and the job would go through enqueue action;
some other plugin like sla or proportion forced this job to go through enqueue. you may try setting overcommit in upper tier, other plugin in lower tier, like:

    actions: "enqueue, allocate, backfill"
    tiers:
    - plugins:
      - name: priority
      - name: gang
      - name: conformance
    - plugins:
      - name: overcommit
        arguments:
          overcommit-factor: 1.0
    - plugins:
      - name: drf
      - name: predicates
      - name: proportion
      - name: nodeorder
      - name: binpack

Hope for your reply. : )

ldd91 · 2022-03-07T03:45:47Z

In my scenario overcommit-factor is default(1.2)

tFable · 2022-03-07T03:58:25Z

Thanks all!

@jiangkaihua

the resource request of volcano job was not correct in yaml, then minResource of this job would be 0 and the job would go through enqueue action;

I don't believe this is the case as I can see in the podgroup status of job2 the minavailable (which is equal to the number of pods in the job)

I tried your suggestion with adding a 3rd tier of pluggins. The behavior is the same (I still see "pending" pods for job2 when job1, which takes the entire cluster, is running)

The configmap does look a bit different visually with 3 plugging tiers

ubectl get configmaps -n volcano-system volcano-scheduler-configmap -o yaml
apiVersion: v1
data:
  volcano-scheduler.conf: "actions: \"enqueue, allocate, backfill\"\ntiers:\n- plugins:\n
    \ - name: priority\n  - name: conformance\n  - name: sla\n- plugins:\n  - name:
    overcommit\n    arguments:\n      \"overcommit-factor\": 1.0 \n- plugins:\n  -
    name: gang\n  - name: drf\n  - name: predicates\n  - name: proportion\n  - name:
    nodeorder\n  - name: binpack\n"
kind: ConfigMap
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"v1","data":{"volcano-scheduler.conf":"actions: \"enqueue, allocate, backfill\"\ntiers:\n- plugins:\n  - name: priority\n  - name: conformance\n  - name: sla\n- plugins:\n  - name: overcommit\n    arguments:\n      \"overcommit-factor\": 1.0 \n- plugins:\n  - name: gang\n  - name: drf\n  - name: predicates\n  - name: proportion\n  - name: nodeorder\n  - name: binpack\n"},"kind":"ConfigMap","metadata":{"annotations":{},"name":"volcano-scheduler-configmap","namespace":"volcano-system"}}
  creationTimestamp: "2022-03-07T03:49:41Z"
  name: volcano-scheduler-configmap
  namespace: volcano-system
  resourceVersion: "2335692"
  uid: 59625e5e-48bc-4b22-82dd-5bc1440ff558

vs 2 plugin tiers

kubectl get configmaps -n volcano-system volcano-scheduler-configmap -o yaml
apiVersion: v1
data:
  volcano-scheduler.conf: |
    actions: "enqueue, allocate, backfill"
    tiers:
    - plugins:
      - name: priority
      - name: conformance
      - name: sla
    - plugins:
      - name: gang
      - name: overcommit
        arguments:
          "overcommit-factor": 1.0
      - name: drf
      - name: predicates
      - name: proportion
      - name: nodeorder
      - name: binpack
kind: ConfigMap
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"v1","data":{"volcano-scheduler.conf":"actions: \"enqueue, allocate, backfill\"\ntiers:\n- plugins:\n  - name: priority\n  - name: conformance\n  - name: sla\n- plugins:\n  - name: gang\n  - name: overcommit\n    arguments:\n      \"overcommit-factor\": 1.0\n  - name: drf\n  - name: predicates\n  - name: proportion\n  - name: nodeorder\n  - name: binpack\n"},"kind":"ConfigMap","metadata":{"annotations":{},"name":"volcano-scheduler-configmap","namespace":"volcano-system"}}
  creationTimestamp: "2022-03-07T03:56:43Z"
  name: volcano-scheduler-configmap
  namespace: volcano-system
  resourceVersion: "2337740"
  uid: 27b631d4-d8dd-4088-a0e7-92201dc06727

But perhaps that's just a parsing issue?

Not sure where to go from here and how to figure out why I see pending pods beyond the cluster's full capacity...

adamnovak · 2022-03-07T14:45:04Z

@tFable When you set the overcommit factor to 1.0, do you see the same number of pending pods as at 1.2?

I think the overcommit is calculated in terms of total cluster resources, ignoring the fact that they are divided into discrete nodes. So if your pods don't quite fit perfectly onto the nodes, and some resources on each node go unused, Volcano will still create pod(s) to use those unused slivers of resources, as if they could be glommed together and used as a whole. So you can end up with some pending pods even at overcommit 1.0, but you should be seeing fewer of them than you would at 1.2.

At 1.0, if you total up your cluster resources, and if you total up your running and pending pod resources, what overcommit factor are you observing in practice? Is it suspiciously close to 1.2?

tFable · 2022-03-07T18:48:21Z

Hey @adamnovak, thanks so much for your message.

As a result of your message, I decided to take a more careful testing approach to see the behavior difference when I have an overcommit-factor: 1.0 vs leaving default (1.2). There is indeed a difference in behavior that I'll try to explain below.

However, before I provide the long details, I developed a theory while performing the tests:
Is it possible that Volcano is considering my control plane server in the calculation of cluster size even though nothing is ever scheduled on the cp server as it's tainted? I have 5 nodes plus 1 cp server. All are the same size.

When I say "the entire cluster" below, I'm getting that number from the total number of pods that I see the cluster being able to run.

Is there a way for me to completely exclude a node from volcano's consideration so it's not even considered in the total cluster size calculation?

Also, for context, all vcjobs are 2-task jobs. The first task is always 1 pod and the second task is a variable quantity of pods. The total number of pods in both tasks always matches the job-level minAvailable. All pods request 1CPU and 512M RAM. There are no limits on the pods.

Scenario: No overcommit factor configuration

configMap (same for all sub-scenarios of this scenario)

kubectl get configmaps -n volcano-system volcano-scheduler-configmap -o yaml
apiVersion: v1
data:
  volcano-scheduler.conf: |
    actions: "enqueue, allocate, backfill"
    tiers:
    - plugins:
      - name: priority
      - name: conformance
      - name: sla
    - plugins:
      - name: overcommit
      - name: gang
      - name: drf
      - name: predicates
      - name: proportion
      - name: nodeorder
      - name: binpack

Sub-scenario 1

job1 already running - taking up entire cluster (35 pods)
job2 submitted multiple times, each time increasing the minAvailable. This scenario is true up to and including minAvailable: 13
log message: this only shows up once overcommit.go:111] Sufficient resources, permit job <test/test-job2> to be inqueue
perform kubectl get pods: I see 13 pending pods for job2
describe job2 podgroup

Status:
  Conditions:
    Last Transition Time:  2022-03-07T17:47:39Z
    Message:               13/13 tasks in gang unschedulable: pod group is not ready, 13 Pending, 13 minAvailable; Pending: 1 Unschedulable, 12 Undetermined
    Reason:                NotEnoughResources
    Status:                True
    Transition ID:         462a0d17-86e5-46b0-8377-8f51b9f4e18a
    Type:                  Unschedulable
  Phase:                   Inqueue
Events:
  Type     Reason         Age                     From     Message
  ----     ------         ----                    ----     -------
  Warning  Unschedulable  4m20s                   volcano  0/0 tasks in gang unschedulable: pod group is not ready, 13 minAvailable
  Warning  Unschedulable  3m56s (x24 over 4m19s)  volcano  13/13 tasks in gang unschedulable: pod group is not ready, 13 Pending, 13 minAvailable; Pending: 1 Unschedulable, 12 Undetermined

Sub-scenario 2

job1 - running taking up entire cluster (35 pods)
job2 - Submitted multiple times, increasing minAvailable each time. This behavior is true starting at minAvailable: 14 and up to and including minAvailable: 21
log: this keeps repeating: overcommit.go:111] Sufficient resources, permit job <test/test-job2> to be inqueue
describe podgroup of job2

Status:
  Conditions:
    Last Transition Time:  2022-03-07T17:52:51Z
    Message:               14/0 tasks in gang unschedulable: pod group is not ready, 14 minAvailable
    Reason:                NotEnoughResources
    Status:                True
    Transition ID:         53b0b3ba-b0c1-4ce4-a807-f98cba3a0a06
    Type:                  Unschedulable
  Phase:                   Pending
Events:
  Type     Reason         Age                 From     Message
  ----     ------         ----                ----     -------
  Warning  Unschedulable  59s (x25 over 83s)  volcano  0/0 tasks in gang unschedulable: pod group is not ready, 14 minAvailable

sub-scenario 3

job1 - running taking up entire cluster (35 pods)
job2 - submitted multiple times, each time increasing the minAvailable. This behavior is true starting at minAvailable: 22 and all sizes beyond that
log: this repeats: overcommit.go:114] Resource in cluster is overused, reject job <test/test-job2> to be inqueue
job2 podgroup describe

Status:
  Conditions:
    Last Transition Time:  2022-03-07T18:08:19Z
    Message:               22/0 tasks in gang unschedulable: pod group is not ready, 22 minAvailable
    Reason:                NotEnoughResources
    Status:                True
    Transition ID:         f6476083-b0ba-4b86-b0ba-759e82306e9b
    Type:                  Unschedulable
  Phase:                   Pending
Events:
  Type     Reason         Age                From     Message
  ----     ------         ----               ----     -------
  Warning  Unschedulable  0s (x13 over 12s)  volcano  0/0 tasks in gang unschedulable: pod group is not ready, 22 minAvailable

Scenario: WITH overcommit-factor: 1.0 configuration

configmap (it's the same for all sub-scenario of this scenario)

Data:
  volcano-scheduler.conf: |
    actions: "enqueue, allocate, backfill"
    tiers:
    - plugins:
      - name: priority
      - name: conformance
      - name: sla
    - plugins:
      - name: overcommit
        arguments:
          overcommit-factor: 1.0
      - name: gang
      - name: drf
      - name: predicates
      - name: proportion
      - name: nodeorder
      - name: binpack

sub-scenario 1

job1 - running and taking up entire cluster (35 pods)
job2 - submitted multiple times, each time increasing the minAvailable. This behavior is true up to and including minAvailable: 11
log - this shows up once: overcommit.go:111] Sufficient resources, permit job <test-job2> to be inqueue
job2 podgroup describe

Status:
  Conditions:
    Last Transition Time:  2022-03-07T18:13:51Z
    Message:               11/11 tasks in gang unschedulable: pod group is not ready, 11 Pending, 11 minAvailable; Pending: 1 Unschedulable, 10 Undetermined
    Reason:                NotEnoughResources
    Status:                True
    Transition ID:         0a543a58-9ef3-4716-ba96-33225ac1d2d4
    Type:                  Unschedulable
  Phase:                   Inqueue
Events:
  Type     Reason         Age                 From     Message
  ----     ------         ----                ----     -------
  Warning  Unschedulable  69s                 volcano  0/0 tasks in gang unschedulable: pod group is not ready, 11 minAvailable
  Warning  Unschedulable  45s (x24 over 68s)  volcano  11/11 tasks in gang unschedulable: pod group is not ready, 11 Pending, 11 minAvailable; Pending: 1 Unschedulable, 10 Undetermined

sub-scenario 2

job1 - running and taking up entire cluster (35 pods)
job2 - submitted multiple times, each time increasing the minAvailable. This behavior is true starting atminAvailable: 11 and all sizes beyond that
log - this shows up continuously: overcommit.go:114] Resource in cluster is overused, reject job <test/test-job2> to be inqueue
job2 podgroup description

Status:
  Conditions:
    Last Transition Time:  2022-03-07T18:24:50Z
    Message:               12/0 tasks in gang unschedulable: pod group is not ready, 12 minAvailable
    Reason:                NotEnoughResources
    Status:                True
    Transition ID:         387571b4-a0c5-4402-8cf0-8f7b0d763130
    Type:                  Unschedulable
  Phase:                   Pending
Events:
  Type     Reason         Age                      From     Message
  ----     ------         ----                     ----     -------
  Warning  Unschedulable  3m32s (x299 over 8m32s)  volcano  0/0 tasks in gang unschedulable: pod group is not ready, 12 minAvailable

tFable · 2022-03-08T17:54:14Z

Hey @adamnovak, @ldd91 , Posting this question in a separate message since this above is quite long?

Is there a way for me to completely exclude a node from volcano's consideration so it's not even considered in the total cluster size calculation? I want to do this for our control plane nodes. Even though volcano doesn't schedule pods on the cp server, it still considers it for the total resource calculation I believe.

Thanks in advance!

Thor-wl · 2022-03-09T00:40:19Z

Hey @adamnovak, @ldd91 , Posting this question in a separate message since this above is quite long?

Is there a way for me to completely exclude a node from volcano's consideration so it's not even considered in the total cluster size calculation? I want to do this for our control plane nodes. Even though volcano doesn't schedule pods on the cp server, it still considers it for the total resource calculation I believe.

Thanks in advance!

@tFable Hey. Of course, Volcano support to consider part of nodes, which is a key feture of v1.5. More details can be referred to here

tFable · 2022-03-09T03:18:25Z

Hey @adamnovak, @ldd91 , Posting this question in a separate message since this above is quite long?
Is there a way for me to completely exclude a node from volcano's consideration so it's not even considered in the total cluster size calculation? I want to do this for our control plane nodes. Even though volcano doesn't schedule pods on the cp server, it still considers it for the total resource calculation I believe.
Thanks in advance!

@tFable Hey. Of course, Volcano support to consider part of nodes, which is a key feture of v1.5. More details can be referred to here

Hey @Thor-wl, this worked beautifully! Thanks so much!

tFable added the kind/bug Categorizes issue or PR as related to a bug. label Mar 4, 2022

tFable mentioned this issue Mar 7, 2022

SLA plugin doesn't work on batch/v1 Job objects; sla-waiting-time from volcano-scheduler.conf is ignored #1901

Closed

tFable mentioned this issue Mar 7, 2022

float64 error from arguments.go:65 when adding overcommit-factor to configmap #2053

Closed

tFable closed this as completed Mar 9, 2022

tghfly mentioned this issue Sep 3, 2023

后执行且低优先级的多个小job总是比先执行高优先级的大job先运行 #3095

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Smaller jobs are scheduled ahead of higher priority jobs #2052

Smaller jobs are scheduled ahead of higher priority jobs #2052

tFable commented Mar 4, 2022 •

edited

Loading

tFable commented Mar 4, 2022

ldd91 commented Mar 4, 2022

tFable commented Mar 4, 2022

tFable commented Mar 4, 2022

tFable commented Mar 5, 2022 •

edited

Loading

tFable commented Mar 5, 2022 •

edited

Loading

tFable commented Mar 7, 2022 •

edited

Loading

ldd91 commented Mar 7, 2022

ldd91 commented Mar 7, 2022 •

edited by k82cn

Loading

ldd91 commented Mar 7, 2022

tFable commented Mar 7, 2022

tFable commented Mar 7, 2022

jiangkaihua commented Mar 7, 2022

ldd91 commented Mar 7, 2022

tFable commented Mar 7, 2022

adamnovak commented Mar 7, 2022

tFable commented Mar 7, 2022

tFable commented Mar 8, 2022

Thor-wl commented Mar 9, 2022

tFable commented Mar 9, 2022

Smaller jobs are scheduled ahead of higher priority jobs #2052

Smaller jobs are scheduled ahead of higher priority jobs #2052

Comments

tFable commented Mar 4, 2022 • edited Loading

tFable commented Mar 4, 2022

ldd91 commented Mar 4, 2022

tFable commented Mar 4, 2022

tFable commented Mar 4, 2022

tFable commented Mar 5, 2022 • edited Loading

tFable commented Mar 5, 2022 • edited Loading

tFable commented Mar 7, 2022 • edited Loading

ldd91 commented Mar 7, 2022

ldd91 commented Mar 7, 2022 • edited by k82cn Loading

ldd91 commented Mar 7, 2022

tFable commented Mar 7, 2022

tFable commented Mar 7, 2022

jiangkaihua commented Mar 7, 2022

ldd91 commented Mar 7, 2022

tFable commented Mar 7, 2022

adamnovak commented Mar 7, 2022

tFable commented Mar 7, 2022

tFable commented Mar 8, 2022

Thor-wl commented Mar 9, 2022

tFable commented Mar 9, 2022

tFable commented Mar 4, 2022 •

edited

Loading

tFable commented Mar 5, 2022 •

edited

Loading

tFable commented Mar 5, 2022 •

edited

Loading

tFable commented Mar 7, 2022 •

edited

Loading

ldd91 commented Mar 7, 2022 •

edited by k82cn

Loading