-
Notifications
You must be signed in to change notification settings - Fork 701
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deprecate MPIJob v1 #1906
Comments
@kubeflow/wg-training-leads |
Are you suggesting moving the entire codebase to training-operator? Or use mpi-operator as a library? |
Use mpi-operator as a library. I think a separate binary for mpi-operator would be worth it since mpi-operator doesn't focus on ML Training. |
Sounds good |
Can you expand on this? This would be helpful for estimating work and allocating sufficient resources. |
Sure. Actually, there are already issues: Headless SVC issue: #1030 |
Friendly ping @johnugeorge :) |
Sorry for late reply. Agree. I am good with deprecating v1 in favor for v2. We need to take it up sometime. Can you explain more on your idea of creating a library? You mean, reconcile logic to be used from MPI operator repo within training-operator? Is it easy in managing manifests etc? We will target all pre-reqs(#1030 #1703) for next training operator 1.8 release and then followed by mpi v2 support in training operator if we have time. What do you think? |
I have discussed this with @johnugeorge offline. We leave the individual mpi-operator, and the training-operator uses mpi-operatror as a library. It means that users can deploy MPIJob v2 as either part of the training operator or the mpi-operator. We have tasks to realize this migration and deprecation: Training Operator Side:
MPI Operator Side:
|
@tenzen-y Thanks! This approach looks good. |
Sounds great! I assume that would fix also #1807, maybe also some other MPIJob tickets: https://github.com/kubeflow/training-operator/issues?q=is%3Aissue+is%3Aopen+mpijob. But more important could be whether there will be regressions compared to current v1 features though. Would training-operator MPIJob tests be updated to v2:
And/or mpi-operator tests brought to training-operator?
|
Yes, that's right.
Yes, we should have proper tests.
No, I think that we wouldn't have tests for MPI-Operator library in this repo. However, I think we should implement unit and e2e tests alongside the training-operator. |
+1 |
I think that we can use kustomize remote ref in the following: apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
- kubeflow.org_tfjobs.yaml
- kubeflow.org_mxjobs.yaml
- kubeflow.org_pytorchjobs.yaml
- kubeflow.org_xgboostjobs.yaml
- github.com/kubeflow/mpi-operator/manifesist/crd.yaml?ref=v0.4.0
- kubeflow.org_paddlejobs.yaml And then, I think that we can have pre-built all-in-one manifests in this repository for the users without internet access.
If users want to install both operators, users need to disable the MPIJob on the training-operator side as in the past. |
@tenzen-y Does it mean that we are going to maintain separate releases for MPI Operator and Training Operator ? |
Yes, that's right. |
@tenzen-y Can you explain why mpi-operator doesn't focus on ML Training? |
MPIJob isn't used only for machine learning. MPIJob is used in generic HPC use cases like simulations. Any thoughts? > @terrytangyuan @alculquicondor |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
/remove-lifecycle frozen |
/remove-lifecycle stale |
/retitle Deprecate MPIJob v1 |
+1 |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
We probably don't want to close this. |
Ideally, we should migrate the v2 implementations to the training operator, then remove the v1 implementation from the training-operator to reduce the maintenance costs. However, we can not take the way immediately because there are many issues in the training operator (e.g. inconsistent job conditions, not using headless svc, and so on). So, I think it would be better to mark the v1 implementation as deprecated, then stop adding the new features to the v1 implementation and only provide bug fixes. So we suggest using the mpi-operator to users if they would like to the new features.
Originally posted by @tenzen-y in #1768 (comment)
The text was updated successfully, but these errors were encountered: