-
Notifications
You must be signed in to change notification settings - Fork 700
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support mpi jobs in universal operator #1345
Comments
/cc @alculquicondor |
cc @zw0610 |
On the last community meeting, I've explained that internally we prefer using the APIs that are already production-ready. From my understanding, the v2 API as long as its controller is still under completion and has not been widely adopted by the community users. Merging v2 version directly may not benefit many users who persist with v1 API. |
I estimate the v2 controller and API will be ready in a month at most. Then we can produce a release. The new controller has more coverage at unit, integration and e2e level, so it's in good track to be production-ready :) |
Also, I'm interested in helping with the migration of mpi-operator to the universal operator, but only after the v2 release. I don't think I would be able to implement everything myself, but I can at the very least help with reviews. |
How about getting v2 API directly in the universal operator while we continue using v1 from current MPI repo? This way, both will be isolated and switching from v1 to v2 will be easy? |
I vote yes to that. We can migrate only the v2 API. But we are not ready yet. |
@zw0610 proposed to merge two version in this repo. Since mpi uses different API versions, it's ok for them to coexist. There're lots of v1 users as I know and they will not immediately migrate to v2 because of both operator and api changes |
One problem I see is that we don't have a conversion webhook. The v2 controller can't support v1 APIs, and even if it could, it cannot handle the missing resources that jobs created with a v1 controller wouldn't have (like the workers Service). I think it's reasonable to only have support for v2. If users don't want to upgrade yet, they can keep using the separate v1 controller. |
I mean two apis + two controllers. Just treat mpiv2 as a separate framework. In that case, we don't necessarily need conversions in that case. |
I don't see much value from adding the two controllers, considering the extra effort. |
I think we should take this as an opportunity to encourage users to upgrade and simplify their infrastructure. |
Is there any progress? Recently some users expect that horovod can be supported in the universal operator. |
/cc @qiankunli |
Our team have been working on this integration, hope that we can submit a draft PR next week. |
Are the integration and E2E tests being migrated as well? |
From @alculquicondor
Let's continue this conversion here. Is everyone okay with ONLY migrating the NEW V2 controller? I can see things will be hard to maintain if we migrate two versions. The old version will be deprecated at some point in the future anyways so we will have to come up with a migration/deprecation plan and schedule. Any functionality we are missing in v2? |
First reference question: |
No. I still see many users using the v1 API. Forcing users to adopt v2 API seems not a good choice for our internal users and external customers. As I brought in community meeting before, it will be great to provide solid user cases for v2 API at scale, which I can use as examples to our users showing the v2 controller is production ready. |
Meanwhile, I think one of the benefits we use one-manager-multi-contorller mode is we can support multiple controllers in one repository. To me, mpi-controller v1 and mpi-controller v2 are just two controllers, which does not need to be maintained by the same group of contributors. I do not fully make sense of the exclusive decision to keep one instead of both. |
Seems like we are facing the chicken or the egg problem. We need adoptions to show production readiness but production readiness comes from a wider range of testing from users. @alculquicondor You mentioned the maintenance efforts for two controllers. I think at least @zw0610's team can help maintain v1 controller. Any other concerns? |
They can still use the v1 API. We just have to provide a conversion webhook. That seems like a much better direction to me. Still, without the conversion webhook, the jobs mostly work, except for a couple of fields. |
Depending on the date for the kubeflow 1.5 release, we can have this plan:
|
FYI I added this to Oct 6th AutoML and Training WG Meeting agenda to discuss. cc @andreyvelich |
Thank you @terrytangyuan. |
Thanks, I'll be there |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
We've migrated pytorch, mxnet(byteps), xgboost (lightbgm). The next step is to consider moving mpi apis and controllers to this repo. (for people who are not familiar with the background, check https://docs.google.com/document/d/1x1JPDQfDMIbnoQRftDH1IzGU0qvHGSU4W6Jl4rJLPhI/edit#heading=h.e33ufidnl8z6)
The reason we didn't make it in the first draft is the code structure of mpi-operator is a little bit different from rest operators.
However, we do see strong needs to adopt it in universal operator so users can easily leverage it to run horovod and horovod elastic jobs. Another big motivation is DGL support can be built on top of MPI-operator. This would be a big gain for graph users.
/cc @kubeflow/wg-training-leads @ryantd @carmark
The text was updated successfully, but these errors were encountered: