-
Notifications
You must be signed in to change notification settings - Fork 39.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Simplifying inter-pod-communication between Job Pods #99497
Comments
@alculquicondor: This issue is currently awaiting triage. If a SIG or subproject determines this is a relevant issue, they will accept it by applying the The Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
+1 to having a mode for stable dns pod names. There are two cases for stable dns names:
While MPI can likely use Statefulset, tf-operator could significantly be simplified if v1.Job support indexed and stable dns names. A TFJob includes at least two sets of jobs, workers and parameter servers (PS). Each worker and PS has a unique index ( tf-operator currently implements this by creating a The problem with Services is two fold:
Indexed Job with stable dns name brings the following advantages to TF distributed training:
|
cc @soltysh |
Proposal (brainstormed with @ahg-g):
These changes are backwards compatible with the existing Indexed Job, so they can be added to such completion mode as part of the beta graduation. I'm bringing this up for the next SIG meeting. @johnbelamaric do you have any thoughts from a DNS perspective? |
That should work. My only concerns would be 1) the scale; and 2) the DNS programming latency. For very large jobs, with short-lived pods this would create a fair bit of churn and API server traffic on the endpointslice watches to CoreDNS (note that when using a headless service, there are no iptables rules programmed, IIRC, so that shouldn't be an issue). If the pods want to inter-communicate, they must also be prepared to retry the lookup in the event that DNS has yet to be programmed when they come up (particularly if you are bringing up 1000s of pods at a time). Another option would be to have some new functionality in CoreDNS (or other DNS provider) that watches jobs and their associated pods and serves up DNS for those. In that case:
|
Thanks John, we could start with relying on a headless service, and then explore the option of having a watch on the jobs in the dns service. I think the implementation from the job controller perspective will not change: setting the hostname and subdomain on the job pods. |
Thanks @johnbelamaric! What you mention is inline with what we thought. On the flip side, the network programming is opt-in, as the user has to create the service and match the Pods' subdomain. We also expect that tightly coupled parallel Jobs wouldn't be very short lived. We will call out lookup retries in the KEP and documentation. The only solution that wouldn't require DNS lookups is one that queries the apiserver for obtaining the Pod IPs (which users could do using the indexes, if they wish to do so). Having the direct support in CoreDNS sounds promising. Glad to know you are open to the idea. We can certainly explore that later. |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-contributor-experience at kubernetes/community. |
We solved this with Indexed Jobs kubernetes/enhancements#2630 /close |
@alculquicondor: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
In #97169, kubernetes/enhancements#2214 we are adding Indexed completion mode, in which each pod of a job gets an index as an annotation. This is useful for static work partitioning in parallel jobs.
We see a potential expansion for indexed completion mode (also mentioned in KEP alternatives) in which it becomes easier to reference each Pod from outside or inside the Job. This can be accomplished in 2 ways:
I'm leaving this issue open for discussion and feedback from sig-apps members and ecosystem developers of HPC frameworks on k8s
/sig apps
/area batch
The text was updated successfully, but these errors were encountered: