Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible use case for exclusive topology at the replicatedJob level #280

Closed
danielvegamyhre opened this issue Aug 29, 2023 · 2 comments
Closed

Comments

@danielvegamyhre
Copy link
Contributor

Currently exclusive topology annotation is set at the jobset level, however, there is a use case for it to be configurable at the replicatedJob level: running TensorFlow workloads on TPUs.

A Tensorflow worker job running on TPUs should run exclusively on a single node pool (TPU pod slice). However, the coordinator job (which would be modeled as a separate replicatedJob) we don't need/want to exclusively run on an entire node pool and block everything else from running on it. This could be solved if the exclusive topology was configurable at the replicatedJob level.

@danielvegamyhre danielvegamyhre changed the title Exclusive topology at the replicatedJob level Possible use case for exclusive topology at the replicatedJob level Aug 29, 2023
@homily707
Copy link

a exclusive topology in jobset level means we want to assign each replicated job exclusively to one domain, so podantiaffinity's LabelSelector query all other replicated job.

Is a exclusive topology in replicatedJob level means we want to assign a replicated job only exclusively with other replicated jobs who also have exclusive topology annotation ? Its podantiaffinity's LabelSelector only query those replicated job with exclusive topology annotation.

example:

apiVersion: jobset.x-k8s.io/v1alpha2
kind: JobSet
metadata:
  name: pytorch
spec:
  replicatedJobs:
    - name: coordinator
       ...
    - name: worker-a
       ...
          annotations: alpha.jobset.sigs.k8s.io/exclusive-topology: rack
    - name: worker-b
       ...
          annotations: alpha.jobset.sigs.k8s.io/exclusive-topology: rack

coordinator won't have pod PodAffinity and PodAntiAffinity
worker-a will have a podAffinity query all pods of worker-a and a podAntiAffinity query all pods of worker-b, don't care the coordinator.
worker-b will have a podAffinity query all pods of worker-b and a podAntiAffinity query all pods of worker-a, don't care the coordinator.

@danielvegamyhre
Copy link
Contributor Author

danielvegamyhre commented Jan 23, 2024

Closing since this will be investigated as part of the larger umbrella issue for defining a Placement Policy API: #75

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants