Add distributed training operator roadmap #61

Jeffwan · 2020-04-06T19:50:58Z

Let's have more discussion here on the training operator 2020 roadmap.

I think most of the operators are pretty stable and major work is to graduate all the other operators to v1 and rewrite operators to use common apis.

Besides that, we do see some feature request not fulfilled last year and let's have a revisit.
In addition, think big on features/products towards production grade level.

@terrytangyuan @gaocegege @johnugeorge @rongou @richardsliu @jian-he @merlintang @suleisl2000 @jlewi

kubeflow-bot · 2020-04-06T19:51:04Z

This change is

ROADMAP.md

terrytangyuan · 2020-04-06T19:55:10Z

ROADMAP.md

+
+* Enhance maintainability of operator common module https://github.com/kubeflow/common/issues/54
+* Migrate operators to use kubeflow/common apis
+* Graduate MPI Operator, Mxnet Operator and XGBoost Operator to v1


YES! We probably need to create concrete plans for each operators (probably in individual repos).

terrytangyuan · 2020-04-06T19:57:14Z

ROADMAP.md

+
+## Features
+
+To take advatanges of other capabilities of job scheduler components, operators will expose more APIs for advanced scheduling. More features will be added to simplify usage like dynamic volume supports and git ops experiences. In order to make it easily used in the Kubeflow ecosystem, we can add more launcher KFP components for adoption.


Could you elaborate on "launcher KFP components for adoption"?

From our customers, users have different options for training component in KFP. We have cloud components like Sagemaker, lots of users want to use Kubeflow based training operator as well. In order to use TF/MPI smoothly, it would be better to write some common launcher KFP component like https://github.com/kubeflow/pipelines/tree/master/components/kubeflow/katib-launcher

User just need to pass count of PS, Worker, Container Image, argument ,etc to kind of the job and output artifacts. This is much easier compare to writing a yaml and use ResouceOp

My point is we need high level abstraction of job to increase operator adoption in KF ecosystem

Does tf launcher https://github.com/kubeflow/pipelines/tree/master/components/kubeflow/launcher here align to what you describe here?
Then what we need to do next is to introduce other Jobs launcher or provide a common launcher.

Yes. We plan to have a common launcher. Using operator separately will be hard to virtual artifacts, model metrics. We have to use KFP in this case or create separate component to integrate these features for training jobs

terrytangyuan · 2020-04-06T19:57:29Z

ROADMAP.md

+To take advatanges of other capabilities of job scheduler components, operators will expose more APIs for advanced scheduling. More features will be added to simplify usage like dynamic volume supports and git ops experiences. In order to make it easily used in the Kubeflow ecosystem, we can add more launcher KFP components for adoption.
+
+* Support dynamic volume provisioning for distributed training jobs https://github.com/kubeflow/common/issues/19
+* MLOPS - Allow user to submit jobs using Git repo without building container images.


MLOPS -> MLOps

Could you elaborate? What's the suggested approach?

I am not sure if that's a blocker. Based on our customer's feedback, DS doesn't like to build container, the most friendly way to them is to drop a git repo. could be a commit or branch.

Trying to see if others have similar feedbacks or feeling,

Another thing is dataset versioning in MLOPS. I am considering to have a separate component to version dataset and link into training operator. But this is just pretty early thought.

Agreed. Though I think git -> image is a separate use case that we don't want to focus here. There are probably existing solutions there already.

Regarding data versioning, it's not an easy thing to do though frequently requested by many users. Pachyderm is designed to make that easier. Perhaps we can work with pachyderm community on that.

Yeah. Agree. Like faring cloud builder uses kaniko to build docker->image in notebook. That's something already exist. data versioning side, there're few solutions like Pachyderm, DVC which aims to solve the problem, we can consider leveraging these exisitng solutions to make it work in Kubeflow ecosystem.

I think the main challenge today is MLOps is a wide definition. The user's requirement is versioning code, data, reduce the time and difficulty to push models into production, etc.

The reasonable solution will be something suitable to current architecture. We definitely need some product definition for it. I can collect some feedbacks and draft some features later.

ROADMAP.md

terrytangyuan · 2020-04-06T20:00:00Z

ROADMAP.md

+## Monitoring
+
+* Provides a better common logger https://github.com/kubeflow/common/issues/60
+* Expose generic prometheus metrics in common operators https://github.com/kubeflow/common/issues/22


cc @ywskycn

terrytangyuan · 2020-04-06T20:00:43Z

ROADMAP.md

+
+* Provides a better common logger https://github.com/kubeflow/common/issues/60
+* Expose generic prometheus metrics in common operators https://github.com/kubeflow/common/issues/22
+* Centralized Job Dashboard for training jobs (Add metadata graph, model artifacts later)


This seems to overlap with some existing projects?

KF had a tf dashboard before but deprecated. Are you talking about that one?

There was a discussion earlier regarding Job dashboard. It was deprecated because users didn't find dashboard much useful. And it is not possible to list all operator spec options in the UI.

One thing can be, about a common dashboard across operators?

Yeah. I think that would be helpful. We there's enough features for that common dashboard, user will like it.

I think reasonable request will look like

Check jobs of all CRDs

basic metadata of the job -> start/end time. job name. resources.

actions -> terminating job

metrics -> it would be great to integrate kubeflow/metadata info directly in that job, like model artifact, log params. Actually I think we can extend metadata SDK to more rich graphs like line chart, bar graph. It will be helpful if we have job level metrics

gaocegege · 2020-04-07T01:51:58Z

ROADMAP.md

+We will continue developing capabilities for better reliability, scaling, and maintenance of production distributed training experiences provided by operators.
+
+* Enhance maintainability of operator common module https://github.com/kubeflow/common/issues/54
+* Migrate operators to use kubeflow/common apis


Do you mean tf and pytorch?

TF and Pytorch is already on it. PyTorch still import TF libraries. I think mxnet is another one.

But, TF and Pytorch is not using common implementation yet.

Tree based model like XGBoost or LightGBM would use this. XGBoost operator is implemented via Common.

gaocegege · 2020-04-07T01:52:23Z

ROADMAP.md

+* Support dynamic volume provisioning for distributed training jobs https://github.com/kubeflow/common/issues/19
+* MLOPS - Allow user to submit jobs using Git repo without building container images.
+* Add Job priority and Queue in SchedulingPolicy for advanced scheduling in common operator https://github.com/kubeflow/common/issues/46
+* Add pipeline launcher components for different training jobs. https://github.com/kubeflow/pipelines/issues/3445


What can we do from the common's side?

I didn't explain this clearly. This won't be done in kubeflow/common, but in same area. Either operator contributors or KFP contributors need to fill the gap.

I think we are trying to consider from high level what we can do to make operators easily and widely used in KF ecosystem

gaocegege · 2020-04-07T05:32:40Z

LGTM.

Thanks for your contribution! 🎉 👍

johnugeorge · 2020-04-07T06:46:53Z

ROADMAP.md

+
+Continue to optimize reconciler performance and reduce latency to take actions on CR events.
+
+* Performance optimization for 500 concurrent jobs and large scale completed jobs. https://github.com/kubeflow/tf-operator/issues/965  https://github.com/kubeflow/tf-operator/issues/1079


Is it planned on kubeflow/common implementation ? Kubeflow/Kubebench can also help here

Won't be inside kubeflow/common, I think it would be better to keep using Kubeflow/Kubebench. Trying to get more feedbacks on common requirements for all different operators. We need to leverage existing solutions.

ROADMAP.md

ChanYiLin · 2020-04-10T03:46:01Z

ROADMAP.md

+* MLOps - Allow user to submit jobs using Git repo without building container images.
+* Add Job priority and Queue in SchedulingPolicy for advanced scheduling in common operator. Realated Issue: [#46](https://github.com/kubeflow/common/issues/46).
+* Add pipeline launcher components for different training jobs. [pipeline#3445](https://github.com/kubeflow/pipelines/issues/3445).
+


I am thinking that there might be many people submitting distributed training job directly using command line and job-operator.
Maybe we can think about how to export the metrics and artifacts to the kubeflow/metadata in this case.
Or currently using metadata sdk in the training code is enough?

@ChanYiLin
User can check metrics or artifacts in separate UI, it's kind of isolated. The idea case is

All job based metadata metrics or artifacts can be aggregated at the job level

Support rich metadata like bar chart, confusion metrics, roc curve, etc. Pipeline project has some, but I notice most of the training operator users may not use KFP, we can know from integration between KFP and operators.

Jeffwan · 2020-04-12T21:08:18Z

Seems no more feedbacks. We can merge this one.

terrytangyuan

Thanks!

/lgtm
/approve

k8s-ci-robot · 2020-04-12T21:29:17Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: terrytangyuan

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [terrytangyuan]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

terrytangyuan · 2020-04-12T23:05:49Z

@Jeffwan Thanks! I created separate issues to track each of the items that are not being tracked yet. We can continue the discussion on those issues.

Jeffwan · 2020-04-13T04:36:10Z

@Jeffwan Thanks! I created separate issues to track each of the items that are not being tracked yet. We can continue the discussion on those issues.

Excellent! Let's talk more details on separate issues

…error (kubeflow#61)

Add distributed training operator roadmap

b8c0efc

k8s-ci-robot requested review from gaocegege and richardsliu April 6, 2020 19:51

k8s-ci-robot added the size/M label Apr 6, 2020

terrytangyuan reviewed Apr 6, 2020

View reviewed changes

gaocegege reviewed Apr 7, 2020

View reviewed changes

johnugeorge reviewed Apr 7, 2020

View reviewed changes

andreyvelich reviewed Apr 7, 2020

View reviewed changes

ROADMAP.md Outdated Show resolved Hide resolved

Address code review feedbacks

c16f30f

ChanYiLin mentioned this pull request Apr 10, 2020

Performance issue when there is a lot of completed jobs kubeflow/training-operator#965

Closed

ChanYiLin reviewed Apr 10, 2020

View reviewed changes

terrytangyuan approved these changes Apr 12, 2020

View reviewed changes

k8s-ci-robot assigned terrytangyuan Apr 12, 2020

k8s-ci-robot added the lgtm label Apr 12, 2020

k8s-ci-robot added the approved label Apr 12, 2020

k8s-ci-robot merged commit c142d96 into kubeflow:master Apr 12, 2020

Jeffwan deleted the roadmap branch April 12, 2020 22:03

georgkaleido pushed a commit to georgkaleido/common that referenced this pull request Jun 9, 2022

Add --verbose in "Deploy to Gemfury" step to investigate Bad Request …

f848e42

…error (kubeflow#61)


		## Features

		To take advatanges of other capabilities of job scheduler components, operators will expose more APIs for advanced scheduling. More features will be added to simplify usage like dynamic volume supports and git ops experiences. In order to make it easily used in the Kubeflow ecosystem, we can add more launcher KFP components for adoption.


		Continue to optimize reconciler performance and reduce latency to take actions on CR events.

		* Performance optimization for 500 concurrent jobs and large scale completed jobs. https://github.com/kubeflow/tf-operator/issues/965 https://github.com/kubeflow/tf-operator/issues/1079

Add distributed training operator roadmap #61

Add distributed training operator roadmap #61

Conversation

Jeffwan commented Apr 6, 2020 • edited Loading

kubeflow-bot commented Apr 6, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

terrytangyuan Apr 10, 2020 • edited Loading

Choose a reason for hiding this comment

Jeffwan Apr 10, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gaocegege commented Apr 7, 2020

johnugeorge Apr 7, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Jeffwan commented Apr 12, 2020

terrytangyuan left a comment

Choose a reason for hiding this comment

k8s-ci-robot commented Apr 12, 2020

terrytangyuan commented Apr 12, 2020 • edited Loading

Jeffwan commented Apr 13, 2020

Jeffwan commented Apr 6, 2020 •

edited

Loading

terrytangyuan Apr 10, 2020 •

edited

Loading

Jeffwan Apr 10, 2020 •

edited

Loading

johnugeorge Apr 7, 2020 •

edited

Loading

terrytangyuan commented Apr 12, 2020 •

edited

Loading