Proposal: Merge operators into one controller manager #103

gaocegege · 2020-12-04T02:15:11Z

We have many operators in the community, and it causes huge maintenance costs. I think we should investigate if we can merge these operators into one process.

/cc @kubeflow/wg-training-leads

WDYT

ChanYiLin · 2020-12-04T03:18:19Z

SGTM !
One of the feedbacks I heard from users is exactly the case.
"Before you run any job, there is already tons of pods running in the k8s."

We have tried to merge multiple controller in one process before. With using kubebuilder SetupWithManager, one can spawn multiple controllers in one pod while implement each controller separately

gaocegege · 2020-12-04T03:25:40Z

Yeah, we also have an internal implementation for it. Alibaba has similar design. I think we can try to merge it back to the community.

gaocegege · 2020-12-04T03:26:05Z

/cc @jian-he

terrytangyuan · 2020-12-07T15:49:36Z

This is mainly for merging the controllers into one controller manager and we still keep the code base for different operators separate, right?

gaocegege · 2020-12-08T01:54:24Z

Personally, I think it will be better to keep all training operators in one repository. WDYT

terrytangyuan · 2020-12-15T02:38:20Z

@gaocegege Personally I think it would be great if the code bases are closer together but it might not be easy to maintain releases and versioning. Also things like stargazers/watchers, commit history, and useful design discussions will be spread out if we start a new repo and develop from scratch.

Jeffwan · 2021-01-25T19:05:52Z

One repo -> one manager -> multiple controller -> crd reconciler loop.

The manager could has many controllers, each controller should only take care of crd. Frameworks should be configured to

@gaocegege What's the plan on your side? We have some internal works going on as well and consider to make it in kubeflow.

gaocegege · 2021-01-26T08:21:38Z

@Jeffwan We also have some internal works on it, maybe we can discuss it

Jeffwan · 2021-03-25T04:55:14Z

Raise this request in 3/24's AutoML and Training meeting. I will draft one proposal for deep discussion in following month. We can either discuss it offline or talk about it in next community meeting.

zw0610 · 2021-04-08T08:11:10Z

I second @terrytangyuan 's suggestion that instead of merging all operators into one repo. While keeping one manager with multiple controller does bring many benefits like saving traffic, if the controllers work independently, the one-manager-multiple-controller design is perpendicular to the idea of sharing job reconciliation function to all operators for develop cost saving. That includes features like elastic training, error handling, event recording, etc.

Instead, maybe we can move kubeflow/common one step forward by creating operator-sdk equivalent executable for this common library, making it something like kubefloe-operator-sdk. The executable generates more training-operator specific functions like ReconcilePods, ReconcileServices in addition to code that operator-sdk generates. So developers would only need to fill these functions for customization.

A summary from most operators:

Operator	Roles	Native* Resources
tf-operator	PS, Worker, Chief, Evaluator	service, pod (podgroup)
mpi-operator	Launcher, Worker	service, pod, configMap, serviceAccount, role, roleBinding (podgroup)
pytorch-operator	Master, Worker	service, pod (podgroup)
xgboost-operator	Master, Worker	service, pod (podgroup)
mxnet-operator	Scheduler, Server, Sorker, TunerTracker, TunerServer, Tuner	service, pod (podgroup)

*while podgroup is not a Kubernetes native resource, here we consider it as one as it is defined out of the xxx-operator scope.

If mpi-operator is able to get rid of the reliance on configMap, serviceAccount, role and roleBinding, maybe we can apply common to all of the operators.

zw0610 · 2021-04-08T08:37:49Z

But here I have another concern. The contemporary design of ReconcilePods and ReconcileServices is really puzzling thanks to the lack of real OOP in golang. Moreover, this design still prevents developers from sharing elastic feature among operators. For example, when the replicas of worker is changed, the Pods creation/deletion logic still needs to be implemented individually for multiple operators. However, the reason why we need ReconcilePods is more about the difference in podTemplate, like the compose of TF_CONFIG.

Is it possible if we use the shared ReconcilePods method, but allow developers to register decorators for podTemplate before it is sent to PodControl.CreatePodsWithControllerRef.

But sure, developers should still be able to 'override' ReconcilePods method if they want.

Jeffwan · 2021-04-09T00:15:05Z

if the controllers work independently, the one-manager-multiple-controller design is perpendicular to the idea of sharing job reconciliation function to all operators for develop cost saving

I don't quite understand perpendicular in this context.. Can you elaborate this with more details?

Instead, maybe we can move kubeflow/common one step forward by creating operator-sdk equivalent executable for this common library, making it something like kubefloe-operator-sdk.

I think this is a good idea. I have two things in mind.

from implementation perspective, I am not sure if there's a way to leverage existing tools like kubebuilder to operator-sdk and inject generated codes. Currently, people still need to use low level api or use kubebuilder to generate skeleton and then use kubeflow/common. With kubeflow-operator-sdk This further reduce time to build up a new DL operator from scratch
Frameworks is growing but still within reasonable number. Making tools for small groups may not worth the efforts. If the tool can be made flexible enough so that All-in-one operator can use it as well, then it may worth the efforts.

Moreover, this design still prevents developers from sharing elastic feature among operators. For example, when the replicas of worker is changed, the Pods creation/deletion logic still needs to be implemented individually for multiple operators.
Is it possible if we use the shared ReconcilePods method, but allow developers to register decorators for podTemplate before it is sent to PodControl.CreatePodsWithControllerRef.

Em. I think we should collect these good use cases and probably try to find a different way to abstract the methods. I agree that there's no guarantee we can make it compatible for future use cases, we can still leave enough flexibility like overide ReconcilePods

zw0610 · 2021-04-09T02:01:22Z

if the controllers work independently, the one-manager-multiple-controller design is perpendicular to the idea of sharing job reconciliation function to all operators for develop cost saving

Hi @Jeffwan , what I mean for the comment above is whether we use one-mgr-multi-controller or controllers working individually is a question of another dimension to the idea of sharing the code on how we reconcile jobs. With or without such a design, we can help developers working on operators for new frameworks with more shared code base.

But as you mentioned, the number of new framework looks limited, which I agree.

Jeffwan · 2021-05-03T06:10:58Z

Some updates here. @zw0610 and I wrote a doc and we still need some time to polish the doc and make it public. Then we can have some discussion in WG meetings.

Jeffwan · 2021-05-19T17:57:35Z

Here's the proposal All-in-one training operator. Any feedbacks are welcomed. I present in 05/19 US & EU friendly meeting and @zw0610 will presents in 06/02 CN & EU friendly meeting.

johnugeorge · 2021-05-20T04:51:37Z

Few thoughts

Are all operators using same amount of resources? Say, In one deployment, Pytorch jobs are spawned more than TF Jobs while in other, MXNet jobs are used more. How do we control/recommend resource usage from a single combined operator point of view?
Since we will lose isolation between operator deployments, will releases become difficult? Also what if there is CR upgrade required for one operator while there are running jobs for the other operators?

@gaocegege Are you having this design already?

gaocegege · 2021-05-20T06:22:59Z

@zw0610 and @Jeffwan are working on the design, I think we can discuss it in the community call.

terrytangyuan · 2021-05-20T11:53:12Z

I also just left some specific comments in the doc. Please take a look when you get a chance.

Jeffwan · 2021-05-21T06:43:07Z

Few thoughts

Are all operators using same amount of resources? Say, In one deployment, Pytorch jobs are spawned more than TF Jobs while in other, MXNet jobs are used more. How do we control/recommend resource usage from a single combined operator point of view?

controllers will use different amount of resources. Instead of understanding each controller's utilization, I would recommend to adjust resources based on total number of the jobs. For users who transit from multiple operators, they can sum up request/limits and use it in the new operator. ( All-in-one operator uses less resources because of shared cache. ) However, this won't be accurate, it would be good for us to do some load testing and give some recommended number.

Since we will lose isolation between operator deployments, will releases become difficult? Also what if there is CR upgrade required for one operator while there are running jobs for the other operators?

good point. Do you have concerns on the job state? Anything different from upgrading an operator nowadays?

johnugeorge · 2021-05-24T18:18:46Z

When any CR is upgraded, we will need operator upgrade and controller restart. Will upgrade be difficult for the users?

Jeffwan · 2021-06-27T00:36:20Z

When any CR is upgraded, we will need operator upgrade and controller restart. Will upgrade be difficult for the users?

I forget to response that, I think the upgrade behavior is similar to existing controller. The only "overhead" is operator owner may need more time for coordination since users are all using one.

gaocegege · 2021-08-31T09:13:07Z

Should we close this issue?

Jeffwan · 2021-08-31T18:23:30Z

/close

google-oss-robot · 2021-08-31T18:23:35Z

@Jeffwan: Closing this issue.

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Co-authored-by: depfu[bot] <23717796+depfu[bot]@users.noreply.github.com> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

gaocegege added feature_request kind/feature labels Dec 4, 2020

terrytangyuan mentioned this issue Apr 29, 2021

Add paddle operator proposal to kubeflow community. kubeflow/community#502

Merged

google-oss-robot closed this as completed Aug 31, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal: Merge operators into one controller manager #103

Proposal: Merge operators into one controller manager #103

gaocegege commented Dec 4, 2020

ChanYiLin commented Dec 4, 2020

gaocegege commented Dec 4, 2020

gaocegege commented Dec 4, 2020

terrytangyuan commented Dec 7, 2020

gaocegege commented Dec 8, 2020

terrytangyuan commented Dec 15, 2020 •

edited

Loading

Jeffwan commented Jan 25, 2021

gaocegege commented Jan 26, 2021

Jeffwan commented Mar 25, 2021

zw0610 commented Apr 8, 2021 •

edited

Loading

zw0610 commented Apr 8, 2021

Jeffwan commented Apr 9, 2021 •

edited

Loading

zw0610 commented Apr 9, 2021 •

edited

Loading

Jeffwan commented May 3, 2021

Jeffwan commented May 19, 2021 •

edited

Loading

johnugeorge commented May 20, 2021

gaocegege commented May 20, 2021

terrytangyuan commented May 20, 2021

Jeffwan commented May 21, 2021

johnugeorge commented May 24, 2021

Jeffwan commented Jun 27, 2021

gaocegege commented Aug 31, 2021

Jeffwan commented Aug 31, 2021

google-oss-robot commented Aug 31, 2021

Proposal: Merge operators into one controller manager #103

Proposal: Merge operators into one controller manager #103

Comments

gaocegege commented Dec 4, 2020

ChanYiLin commented Dec 4, 2020

gaocegege commented Dec 4, 2020

gaocegege commented Dec 4, 2020

terrytangyuan commented Dec 7, 2020

gaocegege commented Dec 8, 2020

terrytangyuan commented Dec 15, 2020 • edited Loading

Jeffwan commented Jan 25, 2021

gaocegege commented Jan 26, 2021

Jeffwan commented Mar 25, 2021

zw0610 commented Apr 8, 2021 • edited Loading

zw0610 commented Apr 8, 2021

Jeffwan commented Apr 9, 2021 • edited Loading

zw0610 commented Apr 9, 2021 • edited Loading

Jeffwan commented May 3, 2021

Jeffwan commented May 19, 2021 • edited Loading

johnugeorge commented May 20, 2021

gaocegege commented May 20, 2021

terrytangyuan commented May 20, 2021

Jeffwan commented May 21, 2021

johnugeorge commented May 24, 2021

Jeffwan commented Jun 27, 2021

gaocegege commented Aug 31, 2021

Jeffwan commented Aug 31, 2021

google-oss-robot commented Aug 31, 2021

terrytangyuan commented Dec 15, 2020 •

edited

Loading

zw0610 commented Apr 8, 2021 •

edited

Loading

Jeffwan commented Apr 9, 2021 •

edited

Loading

zw0610 commented Apr 9, 2021 •

edited

Loading

Jeffwan commented May 19, 2021 •

edited

Loading