-
Notifications
You must be signed in to change notification settings - Fork 263
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support kubeflow's MPIJob #65
Comments
@alculquicondor: The label(s) In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/kind feature |
Just out of curiosity how many ranks are you going to test? Are you also looking into different distributions frameworks besides Horovod? |
Any suggestions? I think our first step is to choose a small number of ranks (for example 3 ) to ensure that the whole process is feasible.
I think Kueue is agnostic about the framework. Whether it is Horovrd or pytorch, as long as it can be launched by MPI through mpi-operator, it is fine. |
@ArangoGutierrez Help me out, what was the critical amount of ranks aka nodes where we have seen bad scaling for the MPI operator? |
Sorry, we are not planning to implement an MPIJob. We are planning to support queuing for the existing kubeflow mpi-operator https://github.com/kubeflow/mpi-operator/tree/master/v2 I think your questions fit better in that repository. |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
/remove-lifecycle stale |
/lifecycle frozen |
Work on mpi-operator will initiate soon kubeflow/mpi-operator#504 The kueue size of things is currently blocked on #369 |
/assign |
Starting the work under #578. IIUC the work isn't strictly blocked on #369. Moreover, the working initial implementation of the MPI integration may help us to better abstract out the interfaces, which could happen as a follow up. Adding the interfaces prior to the MPI integration also makes sense, but we then may need to adapt them, but this may anyway happen with future framework integrations, so maybe there is no point block over another, but just try to align in the process. |
I'm ok with either using the job interface or not since releasing v0.3 isn't blocked by this feature. |
Either, just hope to avoid repetitive work. |
That is kubeflow's mpi-operator. We could have started with other custom jobs, but this one seems important enough for our audience.
They currently don't have a suspend field, so we need to add it. Then, we program the controller based on the existing kueue job-controller.
/label feature
/size L
/priority important-longterm
The text was updated successfully, but these errors were encountered: