Skip to content
This repository has been archived by the owner on Sep 12, 2023. It is now read-only.

Proposal: Merge operators into one controller manager #103

Closed
gaocegege opened this issue Dec 4, 2020 · 24 comments
Closed

Proposal: Merge operators into one controller manager #103

gaocegege opened this issue Dec 4, 2020 · 24 comments

Comments

@gaocegege
Copy link
Member

We have many operators in the community, and it causes huge maintenance costs. I think we should investigate if we can merge these operators into one process.

/cc @kubeflow/wg-training-leads

WDYT

@ChanYiLin
Copy link
Member

SGTM !
One of the feedbacks I heard from users is exactly the case.
"Before you run any job, there is already tons of pods running in the k8s."

We have tried to merge multiple controller in one process before. With using kubebuilder SetupWithManager, one can spawn multiple controllers in one pod while implement each controller separately

@gaocegege
Copy link
Member Author

Yeah, we also have an internal implementation for it. Alibaba has similar design. I think we can try to merge it back to the community.

@gaocegege
Copy link
Member Author

/cc @jian-he

@terrytangyuan
Copy link
Member

This is mainly for merging the controllers into one controller manager and we still keep the code base for different operators separate, right?

@gaocegege
Copy link
Member Author

Personally, I think it will be better to keep all training operators in one repository. WDYT

@terrytangyuan
Copy link
Member

terrytangyuan commented Dec 15, 2020

@gaocegege Personally I think it would be great if the code bases are closer together but it might not be easy to maintain releases and versioning. Also things like stargazers/watchers, commit history, and useful design discussions will be spread out if we start a new repo and develop from scratch.

@Jeffwan
Copy link
Member

Jeffwan commented Jan 25, 2021

One repo -> one manager -> multiple controller -> crd reconciler loop.

The manager could has many controllers, each controller should only take care of crd. Frameworks should be configured to

@gaocegege What's the plan on your side? We have some internal works going on as well and consider to make it in kubeflow.

@gaocegege
Copy link
Member Author

@Jeffwan We also have some internal works on it, maybe we can discuss it

@Jeffwan
Copy link
Member

Jeffwan commented Mar 25, 2021

Raise this request in 3/24's AutoML and Training meeting. I will draft one proposal for deep discussion in following month. We can either discuss it offline or talk about it in next community meeting.

@zw0610
Copy link
Member

zw0610 commented Apr 8, 2021

I second @terrytangyuan 's suggestion that instead of merging all operators into one repo. While keeping one manager with multiple controller does bring many benefits like saving traffic, if the controllers work independently, the one-manager-multiple-controller design is perpendicular to the idea of sharing job reconciliation function to all operators for develop cost saving. That includes features like elastic training, error handling, event recording, etc.

Instead, maybe we can move kubeflow/common one step forward by creating operator-sdk equivalent executable for this common library, making it something like kubefloe-operator-sdk. The executable generates more training-operator specific functions like ReconcilePods, ReconcileServices in addition to code that operator-sdk generates. So developers would only need to fill these functions for customization.


A summary from most operators:

Operator Roles Native* Resources Others
tf-operator PS, Worker, Chief, Evaluator service, pod (podgroup)
mpi-operator Launcher, Worker service, pod, configMap, serviceAccount, role, roleBinding (podgroup)
pytorch-operator Master, Worker service, pod (podgroup)
xgboost-operator Master, Worker service, pod (podgroup)
mxnet-operator Scheduler, Server, Sorker, TunerTracker, TunerServer, Tuner service, pod (podgroup)

*while podgroup is not a Kubernetes native resource, here we consider it as one as it is defined out of the xxx-operator scope.

If mpi-operator is able to get rid of the reliance on configMap, serviceAccount, role and roleBinding, maybe we can apply common to all of the operators.

@zw0610
Copy link
Member

zw0610 commented Apr 8, 2021

But here I have another concern. The contemporary design of ReconcilePods and ReconcileServices is really puzzling thanks to the lack of real OOP in golang. Moreover, this design still prevents developers from sharing elastic feature among operators. For example, when the replicas of worker is changed, the Pods creation/deletion logic still needs to be implemented individually for multiple operators. However, the reason why we need ReconcilePods is more about the difference in podTemplate, like the compose of TF_CONFIG.

Is it possible if we use the shared ReconcilePods method, but allow developers to register decorators for podTemplate before it is sent to PodControl.CreatePodsWithControllerRef.

But sure, developers should still be able to 'override' ReconcilePods method if they want.

@Jeffwan
Copy link
Member

Jeffwan commented Apr 9, 2021

if the controllers work independently, the one-manager-multiple-controller design is perpendicular to the idea of sharing job reconciliation function to all operators for develop cost saving

I don't quite understand perpendicular in this context.. Can you elaborate this with more details?

Instead, maybe we can move kubeflow/common one step forward by creating operator-sdk equivalent executable for this common library, making it something like kubefloe-operator-sdk.

I think this is a good idea. I have two things in mind.

  1. from implementation perspective, I am not sure if there's a way to leverage existing tools like kubebuilder to operator-sdk and inject generated codes. Currently, people still need to use low level api or use kubebuilder to generate skeleton and then use kubeflow/common. With kubeflow-operator-sdk This further reduce time to build up a new DL operator from scratch
  2. Frameworks is growing but still within reasonable number. Making tools for small groups may not worth the efforts. If the tool can be made flexible enough so that All-in-one operator can use it as well, then it may worth the efforts.

Moreover, this design still prevents developers from sharing elastic feature among operators. For example, when the replicas of worker is changed, the Pods creation/deletion logic still needs to be implemented individually for multiple operators.
Is it possible if we use the shared ReconcilePods method, but allow developers to register decorators for podTemplate before it is sent to PodControl.CreatePodsWithControllerRef.

Em. I think we should collect these good use cases and probably try to find a different way to abstract the methods. I agree that there's no guarantee we can make it compatible for future use cases, we can still leave enough flexibility like overide ReconcilePods

@zw0610
Copy link
Member

zw0610 commented Apr 9, 2021

if the controllers work independently, the one-manager-multiple-controller design is perpendicular to the idea of sharing job reconciliation function to all operators for develop cost saving

Hi @Jeffwan , what I mean for the comment above is whether we use one-mgr-multi-controller or controllers working individually is a question of another dimension to the idea of sharing the code on how we reconcile jobs. With or without such a design, we can help developers working on operators for new frameworks with more shared code base.

But as you mentioned, the number of new framework looks limited, which I agree.

@Jeffwan
Copy link
Member

Jeffwan commented May 3, 2021

Some updates here. @zw0610 and I wrote a doc and we still need some time to polish the doc and make it public. Then we can have some discussion in WG meetings.

@Jeffwan
Copy link
Member

Jeffwan commented May 19, 2021

Here's the proposal All-in-one training operator. Any feedbacks are welcomed. I present in 05/19 US & EU friendly meeting and @zw0610 will presents in 06/02 CN & EU friendly meeting.

@johnugeorge
Copy link
Member

Few thoughts

  1. Are all operators using same amount of resources? Say, In one deployment, Pytorch jobs are spawned more than TF Jobs while in other, MXNet jobs are used more. How do we control/recommend resource usage from a single combined operator point of view?
  2. Since we will lose isolation between operator deployments, will releases become difficult? Also what if there is CR upgrade required for one operator while there are running jobs for the other operators?

@gaocegege Are you having this design already?

@gaocegege
Copy link
Member Author

@zw0610 and @Jeffwan are working on the design, I think we can discuss it in the community call.

@terrytangyuan
Copy link
Member

I also just left some specific comments in the doc. Please take a look when you get a chance.

@Jeffwan
Copy link
Member

Jeffwan commented May 21, 2021

Few thoughts

  1. Are all operators using same amount of resources? Say, In one deployment, Pytorch jobs are spawned more than TF Jobs while in other, MXNet jobs are used more. How do we control/recommend resource usage from a single combined operator point of view?

controllers will use different amount of resources. Instead of understanding each controller's utilization, I would recommend to adjust resources based on total number of the jobs. For users who transit from multiple operators, they can sum up request/limits and use it in the new operator. ( All-in-one operator uses less resources because of shared cache. ) However, this won't be accurate, it would be good for us to do some load testing and give some recommended number.

  1. Since we will lose isolation between operator deployments, will releases become difficult? Also what if there is CR upgrade required for one operator while there are running jobs for the other operators?

good point. Do you have concerns on the job state? Anything different from upgrading an operator nowadays?

@johnugeorge
Copy link
Member

When any CR is upgraded, we will need operator upgrade and controller restart. Will upgrade be difficult for the users?

@Jeffwan
Copy link
Member

Jeffwan commented Jun 27, 2021

When any CR is upgraded, we will need operator upgrade and controller restart. Will upgrade be difficult for the users?

I forget to response that, I think the upgrade behavior is similar to existing controller. The only "overhead" is operator owner may need more time for coordination since users are all using one.

@gaocegege
Copy link
Member Author

Should we close this issue?

@Jeffwan
Copy link
Member

Jeffwan commented Aug 31, 2021

/close

@google-oss-robot
Copy link

@Jeffwan: Closing this issue.

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

georgkaleido pushed a commit to georgkaleido/common that referenced this issue Jun 9, 2022
Co-authored-by: depfu[bot] <23717796+depfu[bot]@users.noreply.github.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

7 participants