Support for JobSet Preemption #682

ahg-g · 2024-10-05T21:01:29Z

What would you like to be added:

Preemption at the whole JobSet level.

The user would like to run a training workload using JobSet on one or more accelerator island (e.g., TPU slices). To do this, the user creates a JobSet with a replicatedJob of one or more replicas, and uses exclusive placement to ensure that each child Job lands on one accelerator island.

Consider the case where the user would like to run multiple training workloads as the ones described above with different priorities, and would like to ensure that the high priority training job preempts the low priority ones when there is not enough capacity to run both.

Currently this doesn't work because of the anti-affinity rules that implement exclusive placement:

Low priority workload is running
High priority workload comes in, the leader pod is created first.
The leader pod of the high priority workload can't schedule because it has anti-affinity against all the pods on the island, and cross-node preemption is not supported: https://kubernetes.io/docs/concepts/scheduling-eviction/pod-priority-preemption/#cross-node-preemption

Exclusivity is currently implemented against any pod created by Job that doesn't belong to the same JobSet, specifically it has the following anti-affinity constraint:

      podAntiAffinity: # ensures only this job lands on the rack
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: job-name
                operator: NotIn
                values:
                - my-job
              - key: job-name
                operator: Exists
            namespaceSelector: {}
            topologyKey: rack

The solution to the above problem is to limit exclusivity to the same priority level, and let pod preemption address race conditions if two jobs from different priority levels race to the same slice.

If exclusivity is limited to the same priority level, then in the above example, the leader pod of the higher priority workload will be able to preempt any of the pods of the lower priority workload, and once it does, the worker pods of the higher priority workload will be created and assigned to the same slice and will preempt the rest of the lower priority workers (the low priority worker may also not exist anymore if the low priority workload is restarting because of the initial preemption caused by the leader pod of the high priority workload).

To do this, we need to allow injecting another selector for the priority to the anti-affinity term that JobSet injects automatically.

      podAntiAffinity: # ensures only this job lands on the rack
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: job-name
                operator: NotIn
                values:
                - my-job
              - key: job-name
                operator: Exists
            mismatchLabelKeys
            - priority
            namespaceSelector: {}
            topologyKey: rack

There are two approaches to do this:
Option 1: Update the exclusivity API to allow tweaking the anti-affinity rules as discussed in #75; and then the user explicitly sets the priority label on the jobs and tweaks the anti-affinity rule as discussed above.

Option 2: JobSet does all of that automatically and we make it part of the API, basically JobSet

adds the priority label with key jobset.sigs.k8s.io/priority

sets the selector

 mismatchLabelKeys
 - jobset.sigs.k8s.io/priority

or for backward compatibility:

              - key: jobset.sigs.k8s.io/priority
                operator: NotIn
                values:
                - my-priority

I prefer option 2.

Why is this needed:

Better utilization of infra and faster restart of training workloads: low priority workloads can use spare capacity whenever the high priority ones don't need them, but the capacity is quickly returned to the high priority workloads when they do need them.

This enhancement requires the following artifacts:

Design doc
API change
Docs update

The artifacts should be linked in subsequent comments.

The text was updated successfully, but these errors were encountered:

danielvegamyhre · 2024-10-05T21:57:50Z

This seems like a useful feature, quick question: historically for customers doing TPU multislice training with JobSet we've recommended using Kueue to handle workload priorities and preemption (link). Is the idea here to support this natively in JobSet for customers/users who do not want to use Kueue for whatever reason (additional complexity, etc)?

Seems like we'll also need to think about the interaction between JobSet support for default scheduler preemption, and Kueue priority classes and workload preemption. I'm not familiar with how Kueue implements its workload preemption under the hood, would these changes interfere with Kueue's current preemption implementation?

Also, I prefer option 2 as well, since the implementation will be much more straightforward and does not rely on the new placement policy API, the scope of which has been a point of contention within WG Batch (as far as I know we haven't yet achieved alignment with the Kueue folks on this).

ahg-g · 2024-10-13T19:51:05Z

Kueue doesn't monitor the status of already dispatched workloads. So if a slice of a multi-slice high priority job fails, there is no mechanism to preempt a lower priority job.

What we are proposing here is traditional kube-scheduler preemption, so the semantics are compatible with Kueue.

ahg-g mentioned this issue Oct 5, 2024

Release v0.7.0 requirements #683

Closed

9 tasks

ahg-g changed the title ~~Support for JobSet preemption~~ Support for JobSet Preemption Oct 5, 2024

ahg-g mentioned this issue Oct 20, 2024

Priority-based exclusive placement #687

Merged

ahg-g closed this as completed Oct 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for JobSet Preemption #682

Support for JobSet Preemption #682

ahg-g commented Oct 5, 2024 •

edited

Loading

danielvegamyhre commented Oct 5, 2024 •

edited

Loading

ahg-g commented Oct 13, 2024

Support for JobSet Preemption #682

Support for JobSet Preemption #682

Comments

ahg-g commented Oct 5, 2024 • edited Loading

danielvegamyhre commented Oct 5, 2024 • edited Loading

ahg-g commented Oct 13, 2024

ahg-g commented Oct 5, 2024 •

edited

Loading

danielvegamyhre commented Oct 5, 2024 •

edited

Loading