-
Notifications
You must be signed in to change notification settings - Fork 972
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Smaller jobs are scheduled ahead of higher priority jobs #2052
Comments
It's worth noting that the individual tasks don't have a Here is the entire job template with the tasks:
|
You can add preempt action and have a try |
Hi @ldd91, Will adding the preempt action evict lower-prio jobs (or some of their pods) as soon as higher priority jobs get scheduled? For context: The job-level Thanks again! |
FWIW, I did add the Here's the full config map (I re-installed Volcano to apply this so the change is picked up)
|
This seems to be happening even with the I'm perplexed and any help would be greatly appreciated. This is currently a blocker for us to get Volcano out of PoC. |
The plot thickens... I now believe my issue is related to what the folks in #1901 are seeing. I think the root cause is the the
When job1 starts running, I also see If I rerun the scenario and I run job1 (entire cluster) but chage job3 to a large number of pods (but not the entire cluster), then jobs 2,3,4 all stay in So what I believe is happening when job3 is small number of pods is that the podgroup is small enough to fit in the default I tried changing the |
I am not able to completely disable overcommit. Even when I set
And when job1 is designed to take up the entire cluster, if job2 is small enough, it will also go in That said, I'm seeing the behavior (small low-prio job runs ahead of large high-prio job) even when job2 is NOT small enough to fit in the overcommit. The only way I've been able to get around this is to add an sla to the higher job
|
Sorry for i didn't see your scenario doesn't support preemption/eviction,I'll try to reproduce it |
Here is my confimap
In my scenario,high priority job can be scheduler ahead of low priority job(low priority job has less resource request and was submited the first ) |
In My scenario if the high priority job's request resource was greater than idle resource in cluster,the low priority (request less than idle resource in cluster)will be inqueue and high priority job is pending |
Thanks so much for elaborating @ldd91!
This is also what I'm seeing. To confirm, when you say "idle resource", does that include the resources "created" by the That is, if overcommit-factor is truly |
Also, even if the low priority job doesn't fit in the cluster's idle resource (including overcommit-factor) and both low and high priority jobs were I wonder if this is due to the fact that, as pods from the deleted jobs were being deleted as well, the cluster resources became quicker available for the smaller, lower priority job than for the higher priority but larger job. This is what @jiangkaihua mentioned here. I'll be trying his suggestion to move |
@tFable
actions: "enqueue, allocate, backfill"
tiers:
- plugins:
- name: priority
- name: gang
- name: conformance
- plugins:
- name: overcommit
arguments:
overcommit-factor: 1.0
- plugins:
- name: drf
- name: predicates
- name: proportion
- name: nodeorder
- name: binpack Hope for your reply. : ) |
In my scenario overcommit-factor is default(1.2) |
Thanks all!
I don't believe this is the case as I can see in the podgroup status of job2 the minavailable (which is equal to the number of pods in the job) I tried your suggestion with adding a 3rd tier of pluggins. The behavior is the same (I still see "pending" pods for job2 when job1, which takes the entire cluster, is running) The configmap does look a bit different visually with 3 plugging tiers
vs 2 plugin tiers
But perhaps that's just a parsing issue? Not sure where to go from here and how to figure out why I see |
@tFable When you set the overcommit factor to 1.0, do you see the same number of pending pods as at 1.2? I think the overcommit is calculated in terms of total cluster resources, ignoring the fact that they are divided into discrete nodes. So if your pods don't quite fit perfectly onto the nodes, and some resources on each node go unused, Volcano will still create pod(s) to use those unused slivers of resources, as if they could be glommed together and used as a whole. So you can end up with some pending pods even at overcommit 1.0, but you should be seeing fewer of them than you would at 1.2. At 1.0, if you total up your cluster resources, and if you total up your running and pending pod resources, what overcommit factor are you observing in practice? Is it suspiciously close to 1.2? |
Hey @adamnovak, thanks so much for your message. As a result of your message, I decided to take a more careful testing approach to see the behavior difference when I have an However, before I provide the long details, I developed a theory while performing the tests: When I say "the entire cluster" below, I'm getting that number from the total number of pods that I see the cluster being able to run. Is there a way for me to completely exclude a node from volcano's consideration so it's not even considered in the total cluster size calculation? Also, for context, all vcjobs are 2-task jobs. The first task is always 1 pod and the second task is a variable quantity of pods. The total number of pods in both tasks always matches the job-level Scenario: No overcommit factor configuration
Sub-scenario 1
Sub-scenario 2
sub-scenario 3
Scenario: WITH overcommit-factor: 1.0 configuration
sub-scenario 1
sub-scenario 2
|
Hey @adamnovak, @ldd91 , Posting this question in a separate message since this above is quite long? Is there a way for me to completely exclude a node from volcano's consideration so it's not even considered in the total cluster size calculation? I want to do this for our control plane nodes. Even though volcano doesn't schedule pods on the cp server, it still considers it for the total resource calculation I believe. Thanks in advance! |
@tFable Hey. Of course, Volcano support to consider part of nodes, which is a key feture of v1.5. More details can be referred to here |
Hey @Thor-wl, this worked beautifully! Thanks so much! |
What happened:
For queued jobs in a single queue and in a single namespace, jobs requesting fewer resources are being favored ahead of jobs with a higher value in their
priorityClassName
.When all queued jobs have the same resource requests, the priorityClassName is being honored (jobs with higher value in priorityClassName are executed ahead of jobs with a lower priorityClassName.
However, jobs requesting fewer resources are being favored ahead of jobs with a higher value in their
priorityClassName
.From my relatively new knowledge of Volcano, I presume that would mean that the DRF algorithm seems to be applied ahead evaluating the priorityClassName maybe?
Please note that we are not looking for preemption/eviction of running pods (the
preempt
action is disabled in the configmap as you'll se below)What you expected to happen:
Jobs that have a priorityClassName with a higher value, to be scheduled ahead of all other jobs in the same queue and same namespace regardless of resource requests per job.
How to reproduce it (as minimally and precisely as possible):
priorityClassName: reg-priority. The job will start running. (The replicas of the task matches the job-level
minAvailable`)reg-priority
. The replicas of the task match theminAvailable
Now,
test-job2
is pending sincetest-job1
is taking up the entire cluster.Please note that I have verified this with multiple jobs to ensure that FIFO is not just kicking in. Consistently, the jobs with fewer resources are being scheduled ahead even if other jobs have a higher priority class.
Anything else we need to know?:
The configmaps look like this:
and
queue config
Environment:
kubectl version
): 1.23uname -a
):5.13.0-28-generic #31~20.04.1-Ubuntu SMP Wed Jan 19 14:08:10 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
The text was updated successfully, but these errors were encountered: