-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for JobSet Preemption #682
Comments
This seems like a useful feature, quick question: historically for customers doing TPU multislice training with JobSet we've recommended using Kueue to handle workload priorities and preemption (link). Is the idea here to support this natively in JobSet for customers/users who do not want to use Kueue for whatever reason (additional complexity, etc)? Seems like we'll also need to think about the interaction between JobSet support for default scheduler preemption, and Kueue priority classes and workload preemption. I'm not familiar with how Kueue implements its workload preemption under the hood, would these changes interfere with Kueue's current preemption implementation? Also, I prefer option 2 as well, since the implementation will be much more straightforward and does not rely on the new placement policy API, the scope of which has been a point of contention within WG Batch (as far as I know we haven't yet achieved alignment with the Kueue folks on this). |
Kueue doesn't monitor the status of already dispatched workloads. So if a slice of a multi-slice high priority job fails, there is no mechanism to preempt a lower priority job. What we are proposing here is traditional kube-scheduler preemption, so the semantics are compatible with Kueue. |
What would you like to be added:
Preemption at the whole JobSet level.
The user would like to run a training workload using JobSet on one or more accelerator island (e.g., TPU slices). To do this, the user creates a JobSet with a replicatedJob of one or more replicas, and uses exclusive placement to ensure that each child Job lands on one accelerator island.
Consider the case where the user would like to run multiple training workloads as the ones described above with different priorities, and would like to ensure that the high priority training job preempts the low priority ones when there is not enough capacity to run both.
Currently this doesn't work because of the anti-affinity rules that implement exclusive placement:
Exclusivity is currently implemented against any pod created by Job that doesn't belong to the same JobSet, specifically it has the following anti-affinity constraint:
The solution to the above problem is to limit exclusivity to the same priority level, and let pod preemption address race conditions if two jobs from different priority levels race to the same slice.
If exclusivity is limited to the same priority level, then in the above example, the leader pod of the higher priority workload will be able to preempt any of the pods of the lower priority workload, and once it does, the worker pods of the higher priority workload will be created and assigned to the same slice and will preempt the rest of the lower priority workers (the low priority worker may also not exist anymore if the low priority workload is restarting because of the initial preemption caused by the leader pod of the high priority workload).
To do this, we need to allow injecting another selector for the priority to the anti-affinity term that JobSet injects automatically.
There are two approaches to do this:
Option 1: Update the exclusivity API to allow tweaking the anti-affinity rules as discussed in #75; and then the user explicitly sets the priority label on the jobs and tweaks the anti-affinity rule as discussed above.
Option 2: JobSet does all of that automatically and we make it part of the API, basically JobSet
jobset.sigs.k8s.io/priority
or for backward compatibility:
I prefer option 2.
Why is this needed:
Better utilization of infra and faster restart of training workloads: low priority workloads can use spare capacity whenever the high priority ones don't need them, but the capacity is quickly returned to the high priority workloads when they do need them.
This enhancement requires the following artifacts:
The artifacts should be linked in subsequent comments.
The text was updated successfully, but these errors were encountered: