Possible use case for exclusive topology at the replicatedJob level #280

danielvegamyhre · 2023-08-29T17:25:38Z

Currently exclusive topology annotation is set at the jobset level, however, there is a use case for it to be configurable at the replicatedJob level: running TensorFlow workloads on TPUs.

A Tensorflow worker job running on TPUs should run exclusively on a single node pool (TPU pod slice). However, the coordinator job (which would be modeled as a separate replicatedJob) we don't need/want to exclusively run on an entire node pool and block everything else from running on it. This could be solved if the exclusive topology was configurable at the replicatedJob level.

homily707 · 2023-08-31T07:27:18Z

a exclusive topology in jobset level means we want to assign each replicated job exclusively to one domain, so podantiaffinity's LabelSelector query all other replicated job.

Is a exclusive topology in replicatedJob level means we want to assign a replicated job only exclusively with other replicated jobs who also have exclusive topology annotation ? Its podantiaffinity's LabelSelector only query those replicated job with exclusive topology annotation.

example:

apiVersion: jobset.x-k8s.io/v1alpha2
kind: JobSet
metadata:
  name: pytorch
spec:
  replicatedJobs:
    - name: coordinator
       ...
    - name: worker-a
       ...
          annotations: alpha.jobset.sigs.k8s.io/exclusive-topology: rack
    - name: worker-b
       ...
          annotations: alpha.jobset.sigs.k8s.io/exclusive-topology: rack

coordinator won't have pod PodAffinity and PodAntiAffinity
worker-a will have a podAffinity query all pods of worker-a and a podAntiAffinity query all pods of worker-b, don't care the coordinator.
worker-b will have a podAffinity query all pods of worker-b and a podAntiAffinity query all pods of worker-a, don't care the coordinator.

danielvegamyhre · 2024-01-23T00:10:22Z

Closing since this will be investigated as part of the larger umbrella issue for defining a Placement Policy API: #75

danielvegamyhre changed the title ~~Exclusive topology at the replicatedJob level~~ Possible use case for exclusive topology at the replicatedJob level Aug 29, 2023

danielvegamyhre closed this as completed Jan 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possible use case for exclusive topology at the replicatedJob level #280

Possible use case for exclusive topology at the replicatedJob level #280

danielvegamyhre commented Aug 29, 2023

homily707 commented Aug 31, 2023

danielvegamyhre commented Jan 23, 2024 •

edited

Loading

Possible use case for exclusive topology at the replicatedJob level #280

Possible use case for exclusive topology at the replicatedJob level #280

Comments

danielvegamyhre commented Aug 29, 2023

homily707 commented Aug 31, 2023

danielvegamyhre commented Jan 23, 2024 • edited Loading

danielvegamyhre commented Jan 23, 2024 •

edited

Loading