-
Notifications
You must be signed in to change notification settings - Fork 71
Proposal: Merge operators into one controller manager #103
Comments
SGTM ! We have tried to merge multiple controller in one process before. With using kubebuilder |
Yeah, we also have an internal implementation for it. Alibaba has similar design. I think we can try to merge it back to the community. |
/cc @jian-he |
This is mainly for merging the controllers into one controller manager and we still keep the code base for different operators separate, right? |
Personally, I think it will be better to keep all training operators in one repository. WDYT |
@gaocegege Personally I think it would be great if the code bases are closer together but it might not be easy to maintain releases and versioning. Also things like stargazers/watchers, commit history, and useful design discussions will be spread out if we start a new repo and develop from scratch. |
One repo -> one manager -> multiple controller -> crd reconciler loop. The manager could has many controllers, each controller should only take care of crd. Frameworks should be configured to @gaocegege What's the plan on your side? We have some internal works going on as well and consider to make it in kubeflow. |
@Jeffwan We also have some internal works on it, maybe we can discuss it |
Raise this request in 3/24's AutoML and Training meeting. I will draft one proposal for deep discussion in following month. We can either discuss it offline or talk about it in next community meeting. |
I second @terrytangyuan 's suggestion that instead of merging all operators into one repo. While keeping one manager with multiple controller does bring many benefits like saving traffic, if the controllers work independently, the one-manager-multiple-controller design is perpendicular to the idea of sharing job reconciliation function to all operators for develop cost saving. That includes features like elastic training, error handling, event recording, etc. Instead, maybe we can move kubeflow/common one step forward by creating A summary from most operators:
*while podgroup is not a Kubernetes native resource, here we consider it as one as it is defined out of the xxx-operator scope. If mpi-operator is able to get rid of the reliance on configMap, serviceAccount, role and roleBinding, maybe we can apply common to all of the operators. |
But here I have another concern. The contemporary design of Is it possible if we use the shared But sure, developers should still be able to 'override' |
I don't quite understand
I think this is a good idea. I have two things in mind.
Em. I think we should collect these good use cases and probably try to find a different way to abstract the methods. I agree that there's no guarantee we can make it compatible for future use cases, we can still leave enough flexibility like |
Hi @Jeffwan , what I mean for the comment above is whether we use one-mgr-multi-controller or controllers working individually is a question of another dimension to the idea of sharing the code on how we reconcile jobs. With or without such a design, we can help developers working on operators for new frameworks with more shared code base. But as you mentioned, the number of new framework looks limited, which I agree. |
Some updates here. @zw0610 and I wrote a doc and we still need some time to polish the doc and make it public. Then we can have some discussion in WG meetings. |
Here's the proposal All-in-one training operator. Any feedbacks are welcomed. I present in 05/19 US & EU friendly meeting and @zw0610 will presents in 06/02 CN & EU friendly meeting. |
Few thoughts
@gaocegege Are you having this design already? |
I also just left some specific comments in the doc. Please take a look when you get a chance. |
controllers will use different amount of resources. Instead of understanding each controller's utilization, I would recommend to adjust resources based on total number of the jobs. For users who transit from multiple operators, they can sum up request/limits and use it in the new operator. ( All-in-one operator uses less resources because of shared cache. ) However, this won't be accurate, it would be good for us to do some load testing and give some recommended number.
good point. Do you have concerns on the job state? Anything different from upgrading an operator nowadays? |
When any CR is upgraded, we will need operator upgrade and controller restart. Will upgrade be difficult for the users? |
I forget to response that, I think the upgrade behavior is similar to existing controller. The only "overhead" is operator owner may need more time for coordination since users are all using one. |
Should we close this issue? |
/close |
@Jeffwan: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Co-authored-by: depfu[bot] <23717796+depfu[bot]@users.noreply.github.com> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
We have many operators in the community, and it causes huge maintenance costs. I think we should investigate if we can merge these operators into one process.
/cc @kubeflow/wg-training-leads
WDYT
The text was updated successfully, but these errors were encountered: