Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Smaller jobs are scheduled ahead of higher priority jobs #2052

Closed
tFable opened this issue Mar 4, 2022 · 20 comments
Closed

Smaller jobs are scheduled ahead of higher priority jobs #2052

tFable opened this issue Mar 4, 2022 · 20 comments
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@tFable
Copy link

tFable commented Mar 4, 2022

What happened:
For queued jobs in a single queue and in a single namespace, jobs requesting fewer resources are being favored ahead of jobs with a higher value in their priorityClassName.

When all queued jobs have the same resource requests, the priorityClassName is being honored (jobs with higher value in priorityClassName are executed ahead of jobs with a lower priorityClassName.

However, jobs requesting fewer resources are being favored ahead of jobs with a higher value in their priorityClassName.

From my relatively new knowledge of Volcano, I presume that would mean that the DRF algorithm seems to be applied ahead evaluating the priorityClassName maybe?

Please note that we are not looking for preemption/eviction of running pods (the preempt action is disabled in the configmap as you'll se below)

What you expected to happen:
Jobs that have a priorityClassName with a higher value, to be scheduled ahead of all other jobs in the same queue and same namespace regardless of resource requests per job.

How to reproduce it (as minimally and precisely as possible):

  • Create the following two priority classes
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: high-priority
value: 5000
globalDefault: false
description: "Highest Priority Jobs"
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: reg-priority
value: 1000
globalDefault: false
description: "Regular Priority Jobs"
  • For simplicity, deploy a vcjob that takes up the entire cluster (for me, that's 35 pods) and assign it priorityClassName: reg-priority. The job will start running. (The replicas of the task matches the job-level minAvailable`)
apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
  name: test-job1
spec:
  priorityClassName: reg-priority
  minAvailable: 35
  schedulerName: volcano
  ...
  • Deploy one or more jobs with small resource request and still reg-priority. The replicas of the task match the minAvailable
apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
  name: test-job2
spec:
  priorityClassName: reg-priority
  minAvailable: 12
  schedulerName: volcano
  ...

Now, test-job2 is pending since test-job1 is taking up the entire cluster.

  • Deploy a 3rd job that asks for the entire cluster but, this time, apply a higher priority to the job
apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
  name: test-job3
spec:
  priorityClassName: high-priority
  minAvailable: 35
  schedulerName: volcano
  • Delete test-job1
  • Watch test-job2 be executed and test-job3 remains pending

Please note that I have verified this with multiple jobs to ensure that FIFO is not just kicking in. Consistently, the jobs with fewer resources are being scheduled ahead even if other jobs have a higher priority class.
Anything else we need to know?:
The configmaps look like this:

kubectl get configmaps -n volcano-system volcano-scheduler-configmap -oyaml
apiVersion: v1
data:
  volcano-scheduler.conf: |
    actions: "enqueue, allocate, backfill"
    tiers:
    - plugins:
      - name: priority
      - name: gang
      - name: conformance
    - plugins:
      - name: overcommit
      - name: drf
      - name: predicates
      - name: proportion
      - name: nodeorder
      - name: binpack
kind: ConfigMap
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"v1","data":{"volcano-scheduler.conf":"actions: \"enqueue, allocate, backfill\"\ntiers:\n- plugins:\n  - name: priority\n  - name: gang\n  - name: conformance\n- plugins:\n  - name: overcommit\n  - name: drf\n  - name: predicates\n  - name: proportion\n  - name: nodeorder\n  - name: binpack\n"},"kind":"ConfigMap","metadata":{"annotations":{},"name":"volcano-scheduler-configmap","namespace":"volcano-system"}}
  creationTimestamp: "2022-03-04T12:00:25Z"
  name: volcano-scheduler-configmap
  namespace: volcano-system
  resourceVersion: "1715819"

and

kubectl get configmaps -n volcano-system volcano-admission-configmap -oyaml 
apiVersion: v1
data:
  volcano-admission.conf: |
    #resourceGroups:
    #- resourceGroup: management                    # set the resource group name
    #  object:
    #    key: namespace                             # set the field and the value to be matched
    #    value:
    #    - mng-ns-1
    #  schedulerName: default-scheduler             # set the scheduler for patching
    #  tolerations:                                 # set the tolerations for patching
    #  - effect: NoSchedule
    #    key: taint
    #    operator: Exists
    #  labels:
    #    volcano.sh/nodetype: management           # set the nodeSelector for patching
    #- resourceGroup: cpu
    #  object:
    #    key: annotation
    #    value:
    #    - "volcano.sh/resource-group: cpu"
    #  schedulerName: volcano
    #  labels:
    #    volcano.sh/nodetype: cpu
    #- resourceGroup: gpu                          # if the object is unsetted, default is:  the key is annotation,
    #  schedulerName: volcano                      # the annotation key is fixed and is "volcano.sh/resource-group", The corresponding value is the resourceGroup field
    #  labels:
    #    volcano.sh/nodetype: gpu
kind: ConfigMap
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"v1","data":{"volcano-admission.conf":"#resourceGroups:\n#- resourceGroup: management                    # set the resource group name\n#  object:\n#    key: namespace                             # set the field and the value to be matched\n#    value:\n#    - mng-ns-1\n#  schedulerName: default-scheduler             # set the scheduler for patching\n#  tolerations:                                 # set the tolerations for patching\n#  - effect: NoSchedule\n#    key: taint\n#    operator: Exists\n#  labels:\n#    volcano.sh/nodetype: management           # set the nodeSelector for patching\n#- resourceGroup: cpu\n#  object:\n#    key: annotation\n#    value:\n#    - \"volcano.sh/resource-group: cpu\"\n#  schedulerName: volcano\n#  labels:\n#    volcano.sh/nodetype: cpu\n#- resourceGroup: gpu                          # if the object is unsetted, default is:  the key is annotation,\n#  schedulerName: volcano                      # the annotation key is fixed and is \"volcano.sh/resource-group\", The corresponding value is the resourceGroup field\n#  labels:\n#    volcano.sh/nodetype: gpu\n"},"kind":"ConfigMap","metadata":{"annotations":{},"name":"volcano-admission-configmap","namespace":"volcano-system"}}
  creationTimestamp: "2022-03-04T12:00:24Z"
  name: volcano-admission-configmap
  namespace: volcano-system
  resourceVersion: "1715769"
  uid: 6b364011-0d92-452d-b6c6-93242ac67cbb

queue config

kubectl -n volcano-system get queues.scheduling.volcano.sh q1 -oyaml
apiVersion: scheduling.volcano.sh/v1beta1
kind: Queue
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"scheduling.volcano.sh/v1beta1","kind":"Queue","metadata":{"annotations":{},"name":"q1"},"spec":{"reclaimable":true,"weight":1}}
  creationTimestamp: "2022-03-04T12:00:33Z"
  generation: 1
  name: q1
  resourceVersion: "1721738"
  uid: c988a11f-8b81-4b5f-addf-290b699dda22
spec:
  reclaimable: true
  weight: 1
status:
  pending: 3
  reservation: {}
  running: 1
  state: Open

Environment:

  • Volcano Version: 1.5
  • Kubernetes version (use kubectl version): 1.23
  • Cloud provider or hardware configuration: vSphere on-prem
  • OS (e.g. from /etc/os-release):
NAME="Ubuntu"
VERSION="20.04.3 LTS (Focal Fossa)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 20.04.3 LTS"
VERSION_ID="20.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=focal
UBUNTU_CODENAME=focal
  • Kernel (e.g. uname -a): 5.13.0-28-generic #31~20.04.1-Ubuntu SMP Wed Jan 19 14:08:10 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
  • Install tools:
  • Others:
@tFable tFable added the kind/bug Categorizes issue or PR as related to a bug. label Mar 4, 2022
@tFable
Copy link
Author

tFable commented Mar 4, 2022

It's worth noting that the individual tasks don't have a priorityClassName defined.

Here is the entire job template with the tasks:

apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
  name: test-job4
spec:
  priorityClassName: reg-priority
  minAvailable: 9
  schedulerName: volcano
  maxRetry: 5
  queue: q1
  tasks:
    - replicas: 1
      name: "test-head"
      template:
        metadata:
          labels:
            component: test-job4-ray-head
            type: ray
        spec:
          volumes:
            - name: dshm
              emptyDir:
                medium: Memory
            - name: envy
              hostPath:
                path: /envy
            - name: q
              hostPath:
                path: /q
            - name: auto
              hostPath:
                path: /auto
          containers:
            - name: ray-head
              image: ray/ray
              imagePullPolicy: IfNotPresent
              command: [ "/bin/bash", "-c", "--" ]
              args:
                - "ray start --head --port=6379 --redis-shard-ports=6380,6381 --num-cpus=$MY_CPU_REQUEST --object-manager-port=22345 --node-manager-port=22346 --dashboard-host=0.0.0.0 --block"
              ports:
                - containerPort: 6379 # Redis port
                - containerPort: 10001 # Used by Ray Client
                - containerPort: 8265 # Used by Ray Dashboard
              volumeMounts:
                - mountPath: /dev/shm
                  name: dshm
                - mountPath: /envy
                  mountPropagation: HostToContainer
                  name: envy
                - mountPath: /q
                  mountPropagation: HostToContainer
                  name: q
                - mountPath: /auto
                  mountPropagation: HostToContainer
                  name: auto
              env:
                - name: MY_CPU_REQUEST
                  valueFrom:
                    resourceFieldRef:
                      resource: requests.cpu
              resources:
                requests:
                  cpu: 1
                  memory: 512Mi
    - replicas: 8
      name: "test-workers"
      dependsOn:
        name:
          - "test-head"
      template:
        metadata:
          name: web
        spec:
          restartPolicy: Never
          volumes:
          - name: dshm
            emptyDir:
              medium: Memory
          - name: envy
            hostPath:
              path: /envy
          - name: q
            hostPath:
              path: /q
          - name: auto
            hostPath:
              path: /auto
          containers:
           - name: ray-head
             image: ray/ray
             imagePullPolicy: Always
             command: [ "/bin/bash", "-c", "--" ]
             args:
               - "ray start --num-cpus=$MY_CPU_REQUEST --address=$TEST_JOB4_RAY_HEAD_SERVICE_HOST:$TEST_JOB4_RAY_HEAD_SERVICE_PORT_REDIS --object-manager-port=22345 --node-manager-port=22346 --block"
             ports:
               - containerPort: 6379 # Redis port
               - containerPort: 10001 # Used by Ray Client
               - containerPort: 8265 # Used by Ray Dashboard
             volumeMounts:
               - mountPath: /dev/shm
                 name: dshm
               - mountPath: /envy
                 mountPropagation: HostToContainer
                 name: envy
               - mountPath: /q
                 mountPropagation: HostToContainer
                 name: q
               - mountPath: /auto
                 mountPropagation: HostToContainer
                 name: auto

             env:
               - name: MY_CPU_REQUEST
                 valueFrom:
                   resourceFieldRef:
                     resource: requests.cpu
             resources:
               requests:
                 cpu: 1
                 memory: 512Mi

@ldd91
Copy link

ldd91 commented Mar 4, 2022

You can add preempt action and have a try

@tFable
Copy link
Author

tFable commented Mar 4, 2022

Hi @ldd91,
Thanks for the message.
I'm specifically looking for already-running jobs/pods to not be evicted/preempted (our software doesn't support it).

Will adding the preempt action evict lower-prio jobs (or some of their pods) as soon as higher priority jobs get scheduled?

For context: The job-level minAvailable and the sum of all task replicas are equal.

Thanks again!
t

@tFable
Copy link
Author

tFable commented Mar 4, 2022

FWIW, I did add the preempt action and the behavior remained the same.

Here's the full config map (I re-installed Volcano to apply this so the change is picked up)

kubectl get configmaps -n volcano-system volcano-scheduler-configmap -o yaml
apiVersion: v1
data:
  volcano-scheduler.conf: |
    actions: "enqueue, allocate, backfill, preempt"
    tiers:
    - plugins:
      - name: priority
      - name: gang
      - name: conformance
    - plugins:
      - name: overcommit
      - name: drf
      - name: predicates
      - name: proportion
      - name: nodeorder
      - name: binpack
kind: ConfigMap
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"v1","data":{"volcano-scheduler.conf":"actions: \"enqueue, allocate, backfill, preempt\"\ntiers:\n- plugins:\n  - name: priority\n  - name: gang\n  - name: conformance\n- plugins:\n  - name: overcommit\n  - name: drf\n  - name: predicates\n  - name: proportion\n  - name: nodeorder\n  - name: binpack\n"},"kind":"ConfigMap","metadata":{"annotations":{},"name":"volcano-scheduler-configmap","namespace":"volcano-system"}}
  creationTimestamp: "2022-03-04T17:57:43Z"
  name: volcano-scheduler-configmap
  namespace: volcano-system
  resourceVersion: "1764191"
  uid: f086f7b9-af66-4e02-968a-da235f9829f6

@tFable
Copy link
Author

tFable commented Mar 5, 2022

This seems to be happening even with the drf plugin is not added in the list of plugins. I'm essentially unable to make Volcano not prefer smaller jobs ahead of larger ones no matter what I've tried. Smaller jobs always go in first.

I'm perplexed and any help would be greatly appreciated. This is currently a blocker for us to get Volcano out of PoC.

@tFable
Copy link
Author

tFable commented Mar 5, 2022

The plot thickens...

I now believe my issue is related to what the folks in #1901 are seeing.

I think the root cause is the the overcommit-factor of 1.2. This is what I believe happens.

  • Job 1, that takes up the entire cluster starts running (priority regular)
  • job2 that asks for the entire cluster is queued (priority, regular)
  • job3 that asks for a small number of pods is queued (priority: regular)
  • job4 that asks for the entire cluster is queued (priority: HIGH)

When job1 starts running, I also see pending pods from job3. I also see job3 in inqueue status whereas job 2 and 4 stay in pending while job1 is running.

If I rerun the scenario and I run job1 (entire cluster) but chage job3 to a large number of pods (but not the entire cluster), then jobs 2,3,4 all stay in pending state and there are no pending pods (only the pods from job1 show up in running state).

So what I believe is happening when job3 is small number of pods is that the podgroup is small enough to fit in the default overcommit-factor of 1.2 and thus those pods are in pending state when job1 is running. Because those pods are already in pending state, when job1 is removed, job3 pods and NOT job4 pods (which have higher priority) start running.

I tried changing the overcommit-factor to 1.0 following the example here, however, this doesn't result in a change in behavior.

@tFable
Copy link
Author

tFable commented Mar 7, 2022

I am not able to completely disable overcommit. Even when I set

- plugins:
      - name: overcommit
        arguments:
          "overcommit-factor": 1.0

And when job1 is designed to take up the entire cluster, if job2 is small enough, it will also go in inqueue mode and have pods pending. It seems to me that the overcommit factor never goes to 1 (i.e. no overcommit)

That said, I'm seeing the behavior (small low-prio job runs ahead of large high-prio job) even when job2 is NOT small enough to fit in the overcommit.

The only way I've been able to get around this is to add an sla to the higher job

annotations:
    sla-waiting-time: 2h

@ldd91
Copy link

ldd91 commented Mar 7, 2022

Sorry for i didn't see your scenario doesn't support preemption/eviction,I'll try to reproduce it

@ldd91
Copy link

ldd91 commented Mar 7, 2022

Here is my confimap

kubectl get configmap volcano-scheduler-configmap  -n volcano-system  -o yaml
apiVersion: v1
data:
  volcano-scheduler.conf: |
    actions: "enqueue, allocate, preempt, reclaim, backfill"
    tiers:
    - plugins:
      - name: priority
      - name: gang
      - name: conformance
    - plugins:
      - name: overcommit
      - name: drf
      - name: predicates
      - name: proportion
      - name: nodeorder
      - name: binpack
kind: ConfigMap
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"v1","data":{"volcano-scheduler.conf":"actions: \"enqueue, allocate, backfill\"\ntiers:\n- plugins:\n  - name: priority\n  - name: gang\n  - name: conformance\n- plugins:\n  - name: overcommit\n  - name: drf\n  - name: predicates\n  - name: proportion\n  - name: nodeorder\n  - name: binpack\n"},"kind":"ConfigMap","metadata":{"annotations":{},"name":"volcano-scheduler-configmap","namespace":"volcano-system"}}
  creationTimestamp: "2022-02-21T14:34:47Z"
  name: volcano-scheduler-configmap
  namespace: volcano-system
  resourceVersion: "11819858"
  uid: c5a14834-298e-47f1-b495-c36ec67eb33c

cat low-priority.yaml
apiVersion: scheduling.k8s.io/v1beta1
kind: PriorityClass
metadata:
  name: low-pri
value: -1
[root@7c-fe-90-be-b2-00 priority]#
[root@7c-fe-90-be-b2-00 priority]#
[root@7c-fe-90-be-b2-00 priority]# cat high-priority.yaml
apiVersion: scheduling.k8s.io/v1beta1
kind: PriorityClass
metadata:
  name: high-pri
preemptionPolicy: PreemptLowerPriority
value: 1000

In my scenario,high priority job can be scheduler ahead of low priority job(low priority job has less resource request and was submited the first )

@ldd91
Copy link

ldd91 commented Mar 7, 2022

In My scenario if the high priority job's request resource was greater than idle resource in cluster,the low priority (request less than idle resource in cluster)will be inqueue and high priority job is pending

@tFable
Copy link
Author

tFable commented Mar 7, 2022

Thanks so much for elaborating @ldd91!

In My scenario if the high priority job's request resource was greater than idle resource in cluster, the low priority (request less than idle resource in cluster)will be inqueue and high priority job is pending

This is also what I'm seeing. To confirm, when you say "idle resource", does that include the resources "created" by the overcommit-factor (1.2 by default)?

That is, if overcommit-factor is truly 1.0, then both the lower and higher priority jobs should be in pending if the cluster is completely utilized, correct?

@tFable
Copy link
Author

tFable commented Mar 7, 2022

In My scenario if the high priority job's request resource was greater than idle resource in cluster, the low priority (request less than idle resource in cluster)will be inqueue and high priority job is pending

Also, even if the low priority job doesn't fit in the cluster's idle resource (including overcommit-factor) and both low and high priority jobs were pending, I still observed the smaller, low-priority job started ahead of the high-priority job. Please note that the low priority job was submitted before the high-priority job.

I wonder if this is due to the fact that, as pods from the deleted jobs were being deleted as well, the cluster resources became quicker available for the smaller, lower priority job than for the higher priority but larger job. This is what @jiangkaihua mentioned here. I'll be trying his suggestion to move gang lower now and test again.

@jiangkaihua
Copy link
Contributor

@tFable overcommit-factor is 1.2 by default, and should be no less than 1.0. So 1.0 should work and block overmuch jobs going through enqueue action and creating pods. So I guess it may be caused by several possibilities:

  1. the resource request of volcano job was not correct in yaml, then minResource of this job would be 0 and the job would go through enqueue action;
  2. some other plugin like sla or proportion forced this job to go through enqueue. you may try setting overcommit in upper tier, other plugin in lower tier, like:
    actions: "enqueue, allocate, backfill"
    tiers:
    - plugins:
      - name: priority
      - name: gang
      - name: conformance
    - plugins:
      - name: overcommit
        arguments:
          overcommit-factor: 1.0
    - plugins:
      - name: drf
      - name: predicates
      - name: proportion
      - name: nodeorder
      - name: binpack

Hope for your reply. : )

@ldd91
Copy link

ldd91 commented Mar 7, 2022

In my scenario overcommit-factor is default(1.2)

@tFable
Copy link
Author

tFable commented Mar 7, 2022

Thanks all!

@jiangkaihua

the resource request of volcano job was not correct in yaml, then minResource of this job would be 0 and the job would go through enqueue action;

I don't believe this is the case as I can see in the podgroup status of job2 the minavailable (which is equal to the number of pods in the job)

I tried your suggestion with adding a 3rd tier of pluggins. The behavior is the same (I still see "pending" pods for job2 when job1, which takes the entire cluster, is running)

The configmap does look a bit different visually with 3 plugging tiers

ubectl get configmaps -n volcano-system volcano-scheduler-configmap -o yaml
apiVersion: v1
data:
  volcano-scheduler.conf: "actions: \"enqueue, allocate, backfill\"\ntiers:\n- plugins:\n
    \ - name: priority\n  - name: conformance\n  - name: sla\n- plugins:\n  - name:
    overcommit\n    arguments:\n      \"overcommit-factor\": 1.0 \n- plugins:\n  -
    name: gang\n  - name: drf\n  - name: predicates\n  - name: proportion\n  - name:
    nodeorder\n  - name: binpack\n"
kind: ConfigMap
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"v1","data":{"volcano-scheduler.conf":"actions: \"enqueue, allocate, backfill\"\ntiers:\n- plugins:\n  - name: priority\n  - name: conformance\n  - name: sla\n- plugins:\n  - name: overcommit\n    arguments:\n      \"overcommit-factor\": 1.0 \n- plugins:\n  - name: gang\n  - name: drf\n  - name: predicates\n  - name: proportion\n  - name: nodeorder\n  - name: binpack\n"},"kind":"ConfigMap","metadata":{"annotations":{},"name":"volcano-scheduler-configmap","namespace":"volcano-system"}}
  creationTimestamp: "2022-03-07T03:49:41Z"
  name: volcano-scheduler-configmap
  namespace: volcano-system
  resourceVersion: "2335692"
  uid: 59625e5e-48bc-4b22-82dd-5bc1440ff558

vs 2 plugin tiers

kubectl get configmaps -n volcano-system volcano-scheduler-configmap -o yaml
apiVersion: v1
data:
  volcano-scheduler.conf: |
    actions: "enqueue, allocate, backfill"
    tiers:
    - plugins:
      - name: priority
      - name: conformance
      - name: sla
    - plugins:
      - name: gang
      - name: overcommit
        arguments:
          "overcommit-factor": 1.0
      - name: drf
      - name: predicates
      - name: proportion
      - name: nodeorder
      - name: binpack
kind: ConfigMap
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"v1","data":{"volcano-scheduler.conf":"actions: \"enqueue, allocate, backfill\"\ntiers:\n- plugins:\n  - name: priority\n  - name: conformance\n  - name: sla\n- plugins:\n  - name: gang\n  - name: overcommit\n    arguments:\n      \"overcommit-factor\": 1.0\n  - name: drf\n  - name: predicates\n  - name: proportion\n  - name: nodeorder\n  - name: binpack\n"},"kind":"ConfigMap","metadata":{"annotations":{},"name":"volcano-scheduler-configmap","namespace":"volcano-system"}}
  creationTimestamp: "2022-03-07T03:56:43Z"
  name: volcano-scheduler-configmap
  namespace: volcano-system
  resourceVersion: "2337740"
  uid: 27b631d4-d8dd-4088-a0e7-92201dc06727

But perhaps that's just a parsing issue?

Not sure where to go from here and how to figure out why I see pending pods beyond the cluster's full capacity...

@adamnovak
Copy link

@tFable When you set the overcommit factor to 1.0, do you see the same number of pending pods as at 1.2?

I think the overcommit is calculated in terms of total cluster resources, ignoring the fact that they are divided into discrete nodes. So if your pods don't quite fit perfectly onto the nodes, and some resources on each node go unused, Volcano will still create pod(s) to use those unused slivers of resources, as if they could be glommed together and used as a whole. So you can end up with some pending pods even at overcommit 1.0, but you should be seeing fewer of them than you would at 1.2.

At 1.0, if you total up your cluster resources, and if you total up your running and pending pod resources, what overcommit factor are you observing in practice? Is it suspiciously close to 1.2?

@tFable
Copy link
Author

tFable commented Mar 7, 2022

Hey @adamnovak, thanks so much for your message.

As a result of your message, I decided to take a more careful testing approach to see the behavior difference when I have an overcommit-factor: 1.0 vs leaving default (1.2). There is indeed a difference in behavior that I'll try to explain below.

However, before I provide the long details, I developed a theory while performing the tests:
Is it possible that Volcano is considering my control plane server in the calculation of cluster size even though nothing is ever scheduled on the cp server as it's tainted? I have 5 nodes plus 1 cp server. All are the same size.

When I say "the entire cluster" below, I'm getting that number from the total number of pods that I see the cluster being able to run.

Is there a way for me to completely exclude a node from volcano's consideration so it's not even considered in the total cluster size calculation?

Also, for context, all vcjobs are 2-task jobs. The first task is always 1 pod and the second task is a variable quantity of pods. The total number of pods in both tasks always matches the job-level minAvailable. All pods request 1CPU and 512M RAM. There are no limits on the pods.

Scenario: No overcommit factor configuration

  • configMap (same for all sub-scenarios of this scenario)
kubectl get configmaps -n volcano-system volcano-scheduler-configmap -o yaml
apiVersion: v1
data:
  volcano-scheduler.conf: |
    actions: "enqueue, allocate, backfill"
    tiers:
    - plugins:
      - name: priority
      - name: conformance
      - name: sla
    - plugins:
      - name: overcommit
      - name: gang
      - name: drf
      - name: predicates
      - name: proportion
      - name: nodeorder
      - name: binpack

Sub-scenario 1

  • job1 already running - taking up entire cluster (35 pods)
  • job2 submitted multiple times, each time increasing the minAvailable. This scenario is true up to and including minAvailable: 13
  • log message: this only shows up once overcommit.go:111] Sufficient resources, permit job <test/test-job2> to be inqueue
  • perform kubectl get pods: I see 13 pending pods for job2
  • describe job2 podgroup
Status:
  Conditions:
    Last Transition Time:  2022-03-07T17:47:39Z
    Message:               13/13 tasks in gang unschedulable: pod group is not ready, 13 Pending, 13 minAvailable; Pending: 1 Unschedulable, 12 Undetermined
    Reason:                NotEnoughResources
    Status:                True
    Transition ID:         462a0d17-86e5-46b0-8377-8f51b9f4e18a
    Type:                  Unschedulable
  Phase:                   Inqueue
Events:
  Type     Reason         Age                     From     Message
  ----     ------         ----                    ----     -------
  Warning  Unschedulable  4m20s                   volcano  0/0 tasks in gang unschedulable: pod group is not ready, 13 minAvailable
  Warning  Unschedulable  3m56s (x24 over 4m19s)  volcano  13/13 tasks in gang unschedulable: pod group is not ready, 13 Pending, 13 minAvailable; Pending: 1 Unschedulable, 12 Undetermined

Sub-scenario 2

  • job1 - running taking up entire cluster (35 pods)
  • job2 - Submitted multiple times, increasing minAvailable each time. This behavior is true starting at minAvailable: 14 and up to and including minAvailable: 21
  • log: this keeps repeating: overcommit.go:111] Sufficient resources, permit job <test/test-job2> to be inqueue
  • describe podgroup of job2
Status:
  Conditions:
    Last Transition Time:  2022-03-07T17:52:51Z
    Message:               14/0 tasks in gang unschedulable: pod group is not ready, 14 minAvailable
    Reason:                NotEnoughResources
    Status:                True
    Transition ID:         53b0b3ba-b0c1-4ce4-a807-f98cba3a0a06
    Type:                  Unschedulable
  Phase:                   Pending
Events:
  Type     Reason         Age                 From     Message
  ----     ------         ----                ----     -------
  Warning  Unschedulable  59s (x25 over 83s)  volcano  0/0 tasks in gang unschedulable: pod group is not ready, 14 minAvailable

sub-scenario 3

  • job1 - running taking up entire cluster (35 pods)
  • job2 - submitted multiple times, each time increasing the minAvailable. This behavior is true starting at minAvailable: 22 and all sizes beyond that
  • log: this repeats: overcommit.go:114] Resource in cluster is overused, reject job <test/test-job2> to be inqueue
  • job2 podgroup describe
Status:
  Conditions:
    Last Transition Time:  2022-03-07T18:08:19Z
    Message:               22/0 tasks in gang unschedulable: pod group is not ready, 22 minAvailable
    Reason:                NotEnoughResources
    Status:                True
    Transition ID:         f6476083-b0ba-4b86-b0ba-759e82306e9b
    Type:                  Unschedulable
  Phase:                   Pending
Events:
  Type     Reason         Age                From     Message
  ----     ------         ----               ----     -------
  Warning  Unschedulable  0s (x13 over 12s)  volcano  0/0 tasks in gang unschedulable: pod group is not ready, 22 minAvailable

Scenario: WITH overcommit-factor: 1.0 configuration

  • configmap (it's the same for all sub-scenario of this scenario)
Data:
  volcano-scheduler.conf: |
    actions: "enqueue, allocate, backfill"
    tiers:
    - plugins:
      - name: priority
      - name: conformance
      - name: sla
    - plugins:
      - name: overcommit
        arguments:
          overcommit-factor: 1.0
      - name: gang
      - name: drf
      - name: predicates
      - name: proportion
      - name: nodeorder
      - name: binpack 

sub-scenario 1

  • job1 - running and taking up entire cluster (35 pods)
  • job2 - submitted multiple times, each time increasing the minAvailable. This behavior is true up to and including minAvailable: 11
  • log - this shows up once: overcommit.go:111] Sufficient resources, permit job <test-job2> to be inqueue
  • job2 podgroup describe
Status:
  Conditions:
    Last Transition Time:  2022-03-07T18:13:51Z
    Message:               11/11 tasks in gang unschedulable: pod group is not ready, 11 Pending, 11 minAvailable; Pending: 1 Unschedulable, 10 Undetermined
    Reason:                NotEnoughResources
    Status:                True
    Transition ID:         0a543a58-9ef3-4716-ba96-33225ac1d2d4
    Type:                  Unschedulable
  Phase:                   Inqueue
Events:
  Type     Reason         Age                 From     Message
  ----     ------         ----                ----     -------
  Warning  Unschedulable  69s                 volcano  0/0 tasks in gang unschedulable: pod group is not ready, 11 minAvailable
  Warning  Unschedulable  45s (x24 over 68s)  volcano  11/11 tasks in gang unschedulable: pod group is not ready, 11 Pending, 11 minAvailable; Pending: 1 Unschedulable, 10 Undetermined 

sub-scenario 2

  • job1 - running and taking up entire cluster (35 pods)
  • job2 - submitted multiple times, each time increasing the minAvailable. This behavior is true starting atminAvailable: 11 and all sizes beyond that
  • log - this shows up continuously: overcommit.go:114] Resource in cluster is overused, reject job <test/test-job2> to be inqueue
  • job2 podgroup description
Status:
  Conditions:
    Last Transition Time:  2022-03-07T18:24:50Z
    Message:               12/0 tasks in gang unschedulable: pod group is not ready, 12 minAvailable
    Reason:                NotEnoughResources
    Status:                True
    Transition ID:         387571b4-a0c5-4402-8cf0-8f7b0d763130
    Type:                  Unschedulable
  Phase:                   Pending
Events:
  Type     Reason         Age                      From     Message
  ----     ------         ----                     ----     -------
  Warning  Unschedulable  3m32s (x299 over 8m32s)  volcano  0/0 tasks in gang unschedulable: pod group is not ready, 12 minAvailable

@tFable
Copy link
Author

tFable commented Mar 8, 2022

Hey @adamnovak, @ldd91 , Posting this question in a separate message since this above is quite long?

Is there a way for me to completely exclude a node from volcano's consideration so it's not even considered in the total cluster size calculation? I want to do this for our control plane nodes. Even though volcano doesn't schedule pods on the cp server, it still considers it for the total resource calculation I believe.

Thanks in advance!

@Thor-wl
Copy link
Contributor

Thor-wl commented Mar 9, 2022

Hey @adamnovak, @ldd91 , Posting this question in a separate message since this above is quite long?

Is there a way for me to completely exclude a node from volcano's consideration so it's not even considered in the total cluster size calculation? I want to do this for our control plane nodes. Even though volcano doesn't schedule pods on the cp server, it still considers it for the total resource calculation I believe.

Thanks in advance!

@tFable Hey. Of course, Volcano support to consider part of nodes, which is a key feture of v1.5. More details can be referred to here

@tFable
Copy link
Author

tFable commented Mar 9, 2022

Hey @adamnovak, @ldd91 , Posting this question in a separate message since this above is quite long?
Is there a way for me to completely exclude a node from volcano's consideration so it's not even considered in the total cluster size calculation? I want to do this for our control plane nodes. Even though volcano doesn't schedule pods on the cp server, it still considers it for the total resource calculation I believe.
Thanks in advance!

@tFable Hey. Of course, Volcano support to consider part of nodes, which is a key feture of v1.5. More details can be referred to here

Hey @Thor-wl, this worked beautifully! Thanks so much!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests

5 participants