-
Notifications
You must be signed in to change notification settings - Fork 700
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce batch/v1 Job with Indexed completion mode #1718
Comments
Interesting. This would need a major rewrite ? |
Probably, yes. Replacing Pod with batch/v1 Indexed Job, we can use Job with Pod-to-Pod Communication logic. Also, we maybe remove most of the For example, we can replace This means replacing that, we don't need to hold the logic to judge whether restart pods. |
/cc @alculquicondor |
IIUC, Indexed Job feature aims tf-operator, mpi-operator and more. |
Since it's the major benefits is to reduce the duplicated developing workload regarding some features, do we have such a feature list which are not implemented in training-operator but will be completed once the batch/v1 Job was adopted? |
Good point. I don't create such a list yet. At present, I found the suspend Job and pod disruption conditions. I'll create a list and share it with you in this issue. |
+1 If you could provide a list of any missing functionality in the Job API, we could add those to the roadmap. Also @ahg-g is working on a proposal for a multi-pod-template API, that he's going to present in the Batch working group meeting on Feb 2nd https://docs.google.com/document/d/1XOeUN-K0aKmJJNq7H07r74n-mGgSFyiEDQ3ecwsGhec/edit#heading=h.ukbaidczvy3r |
+1 The benefit is not just deduping code, but also helping to defragment the ecosystem. While I do understand the benefit of having dedicated APIs for MPI, TF training etc., it is important that they build on a common API that we can use for job-level scheduling and autoscaling. As @alculquicondor mentioned, I am working on a proposal that I will make public next week and will be discussed in Kubernetes batch working group. I am also happy to schedule a time and discuss it with the kubeflow community, can you please let me know how/where I can put this topic on the meeting agenda? |
@alculquicondor training-operator/pkg/controller.v1/tensorflow/tfjob_controller.go Lines 505 to 534 in ddf372c
Maybe, tensorflow-controller or training-controller can be one of the use cases to introduce the success policy to batch/v1 Job.
Thanks for sharing. I'm interested a muti-pod-template API since we can consider using the API after we introduce batch/v1 Job to training-operator. Are there KEPs for a multi-pod-template API in k/enhancement? |
@ahg-g Yes, exactly. I think so too.
We have bi-weekly community meetings for WG Training, and there is a meeting note in https://docs.google.com/document/d/1MChKfzrKAeFRtYqypFbMXL6ZIc_OgijjkvbqmwRV-64/edit#. I rarely attend meetings, but you can share a multiple-pod-template API with WG Training leads. |
Not yet, as I mentioned above, I will share a google doc next week, it is easier to discuss such a significant proposal on a google doc first before we move to a KEP. Note that the plan is to initially host the API under the Kueue project to iterate fast on the api with the goal of upstreaming it eventually. |
I see. Thanks for letting me know. |
I will work on this issue after the kubeflow v1.7 feature freeze date since that date is coming up. Then, I will share the corresponding table for batch/v1 Job and training-operator Job feature in this issue. If I find this issue with significant API changes, I will submit a proposal to this repository. Also, I will work on the actual implementation after #1714 is done. |
/cc @richardsliu |
Maybe, we need to wait for the |
Is elastic Pytorch the only training job that supports resizing? Does it matter which workers get removed? |
IIUC, we are supporting only one master replica for PytorchJob. So yes. training-operator/pkg/apis/kubeflow.org/v1/pytorch_validation.go Lines 62 to 66 in b87c6fa
Maybe, It does not matter which worker is deleted since the Elastic PytorchJob uses a local elastic agent. @gaocegege @zw0610 If I misunderstand the Elastic PytorchJob, can you correct me? |
Also, we may use |
The Elastic Indexed Job is supposed to graduate to beta in K8s 1.27. So we can work on this once we stop supporting k8s 1.26 (maybe next year?). |
I agree |
We may be able to introduce the JobSet instead of batch/job, although I think we need to wait for the JobSet API to be in beta. |
As a first step, migrating to batch/job might be better. After that, we migrate to the JobSet. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
/lifecycle frozen |
/assign |
/kind discussion
We often implement features similar to batch/v1 Job (e.g., kubeflow/common#196) since the training operator creates blocks of Pod + Service for each rank, not batch/v1 Job + Service once the custom Job resource (e.g., TFJob) is created.
IIUC, training-operator designed like the above since training-operator core architecture is created before the Indexed Job feature and Pod failure policy feature are released.
So I would like to propose the architecture that the training-operator creates batch/v1 Job with Indexed completion mode + Service, not Pod + Service.
Introducing batch/v1 Job eliminates the need to implement and maintain features similar to batch/v1 Job and makes introducing new batch/v1 Job features easy.
/cc @kubeflow/wg-training-leads
The text was updated successfully, but these errors were encountered: