You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently exclusive topology annotation is set at the jobset level, however, there is a use case for it to be configurable at the replicatedJob level: running TensorFlow workloads on TPUs.
A Tensorflow worker job running on TPUs should run exclusively on a single node pool (TPU pod slice). However, the coordinator job (which would be modeled as a separate replicatedJob) we don't need/want to exclusively run on an entire node pool and block everything else from running on it. This could be solved if the exclusive topology was configurable at the replicatedJob level.
The text was updated successfully, but these errors were encountered:
danielvegamyhre
changed the title
Exclusive topology at the replicatedJob level
Possible use case for exclusive topology at the replicatedJob level
Aug 29, 2023
a exclusive topology in jobset level means we want to assign each replicated job exclusively to one domain, so podantiaffinity's LabelSelector query all other replicated job.
Is a exclusive topology in replicatedJob level means we want to assign a replicated job only exclusively with other replicated jobs who also have exclusive topology annotation ? Its podantiaffinity's LabelSelector only query those replicated job with exclusive topology annotation.
coordinator won't have pod PodAffinity and PodAntiAffinity
worker-a will have a podAffinity query all pods of worker-a and a podAntiAffinity query all pods of worker-b, don't care the coordinator.
worker-b will have a podAffinity query all pods of worker-b and a podAntiAffinity query all pods of worker-a, don't care the coordinator.
Currently exclusive topology annotation is set at the jobset level, however, there is a use case for it to be configurable at the replicatedJob level: running TensorFlow workloads on TPUs.
A Tensorflow worker job running on TPUs should run exclusively on a single node pool (TPU pod slice). However, the coordinator job (which would be modeled as a separate replicatedJob) we don't need/want to exclusively run on an entire node pool and block everything else from running on it. This could be solved if the exclusive topology was configurable at the replicatedJob level.
The text was updated successfully, but these errors were encountered: