Skip to content
This repository has been archived by the owner on Sep 12, 2023. It is now read-only.

Add distributed training operator roadmap #61

Merged
merged 2 commits into from
Apr 12, 2020

Conversation

Jeffwan
Copy link
Member

@Jeffwan Jeffwan commented Apr 6, 2020

Let's have more discussion here on the training operator 2020 roadmap.

I think most of the operators are pretty stable and major work is to graduate all the other operators to v1 and rewrite operators to use common apis.

Besides that, we do see some feature request not fulfilled last year and let's have a revisit.
In addition, think big on features/products towards production grade level.

@terrytangyuan @gaocegege @johnugeorge @rongou @richardsliu @jian-he @merlintang @suleisl2000 @jlewi

@kubeflow-bot
Copy link

This change is Reviewable

ROADMAP.md Outdated Show resolved Hide resolved
ROADMAP.md Outdated Show resolved Hide resolved
ROADMAP.md Outdated

* Enhance maintainability of operator common module https://github.com/kubeflow/common/issues/54
* Migrate operators to use kubeflow/common apis
* Graduate MPI Operator, Mxnet Operator and XGBoost Operator to v1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

YES! We probably need to create concrete plans for each operators (probably in individual repos).

ROADMAP.md Outdated

## Features

To take advatanges of other capabilities of job scheduler components, operators will expose more APIs for advanced scheduling. More features will be added to simplify usage like dynamic volume supports and git ops experiences. In order to make it easily used in the Kubeflow ecosystem, we can add more launcher KFP components for adoption.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you elaborate on "launcher KFP components for adoption"?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From our customers, users have different options for training component in KFP. We have cloud components like Sagemaker, lots of users want to use Kubeflow based training operator as well. In order to use TF/MPI smoothly, it would be better to write some common launcher KFP component like https://github.com/kubeflow/pipelines/tree/master/components/kubeflow/katib-launcher

User just need to pass count of PS, Worker, Container Image, argument ,etc to kind of the job and output artifacts. This is much easier compare to writing a yaml and use ResouceOp

My point is we need high level abstraction of job to increase operator adoption in KF ecosystem

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does tf launcher https://github.com/kubeflow/pipelines/tree/master/components/kubeflow/launcher here align to what you describe here?
Then what we need to do next is to introduce other Jobs launcher or provide a common launcher.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. We plan to have a common launcher. Using operator separately will be hard to virtual artifacts, model metrics. We have to use KFP in this case or create separate component to integrate these features for training jobs

ROADMAP.md Outdated
To take advatanges of other capabilities of job scheduler components, operators will expose more APIs for advanced scheduling. More features will be added to simplify usage like dynamic volume supports and git ops experiences. In order to make it easily used in the Kubeflow ecosystem, we can add more launcher KFP components for adoption.

* Support dynamic volume provisioning for distributed training jobs https://github.com/kubeflow/common/issues/19
* MLOPS - Allow user to submit jobs using Git repo without building container images.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MLOPS -> MLOps

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you elaborate? What's the suggested approach?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if that's a blocker. Based on our customer's feedback, DS doesn't like to build container, the most friendly way to them is to drop a git repo. could be a commit or branch.

Trying to see if others have similar feedbacks or feeling,

Another thing is dataset versioning in MLOPS. I am considering to have a separate component to version dataset and link into training operator. But this is just pretty early thought.

Copy link
Member

@terrytangyuan terrytangyuan Apr 10, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. Though I think git -> image is a separate use case that we don't want to focus here. There are probably existing solutions there already.

Regarding data versioning, it's not an easy thing to do though frequently requested by many users. Pachyderm is designed to make that easier. Perhaps we can work with pachyderm community on that.

Copy link
Member Author

@Jeffwan Jeffwan Apr 10, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. Agree. Like faring cloud builder uses kaniko to build docker->image in notebook. That's something already exist. data versioning side, there're few solutions like Pachyderm, DVC which aims to solve the problem, we can consider leveraging these exisitng solutions to make it work in Kubeflow ecosystem.

I think the main challenge today is MLOps is a wide definition. The user's requirement is versioning code, data, reduce the time and difficulty to push models into production, etc.

The reasonable solution will be something suitable to current architecture. We definitely need some product definition for it. I can collect some feedbacks and draft some features later.

ROADMAP.md Outdated Show resolved Hide resolved
ROADMAP.md Outdated
## Monitoring

* Provides a better common logger https://github.com/kubeflow/common/issues/60
* Expose generic prometheus metrics in common operators https://github.com/kubeflow/common/issues/22
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ROADMAP.md Outdated

* Provides a better common logger https://github.com/kubeflow/common/issues/60
* Expose generic prometheus metrics in common operators https://github.com/kubeflow/common/issues/22
* Centralized Job Dashboard for training jobs (Add metadata graph, model artifacts later)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to overlap with some existing projects?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

KF had a tf dashboard before but deprecated. Are you talking about that one?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was a discussion earlier regarding Job dashboard. It was deprecated because users didn't find dashboard much useful. And it is not possible to list all operator spec options in the UI.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing can be, about a common dashboard across operators?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. I think that would be helpful. We there's enough features for that common dashboard, user will like it.

I think reasonable request will look like

  1. Check jobs of all CRDs
  2. basic metadata of the job -> start/end time. job name. resources.
  3. actions -> terminating job
  4. metrics -> it would be great to integrate kubeflow/metadata info directly in that job, like model artifact, log params. Actually I think we can extend metadata SDK to more rich graphs like line chart, bar graph. It will be helpful if we have job level metrics

ROADMAP.md Outdated
We will continue developing capabilities for better reliability, scaling, and maintenance of production distributed training experiences provided by operators.

* Enhance maintainability of operator common module https://github.com/kubeflow/common/issues/54
* Migrate operators to use kubeflow/common apis
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean tf and pytorch?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TF and Pytorch is already on it. PyTorch still import TF libraries. I think mxnet is another one.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But, TF and Pytorch is not using common implementation yet.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tree based model like XGBoost or LightGBM would use this. XGBoost operator is implemented via Common.

ROADMAP.md Outdated
* Support dynamic volume provisioning for distributed training jobs https://github.com/kubeflow/common/issues/19
* MLOPS - Allow user to submit jobs using Git repo without building container images.
* Add Job priority and Queue in SchedulingPolicy for advanced scheduling in common operator https://github.com/kubeflow/common/issues/46
* Add pipeline launcher components for different training jobs. https://github.com/kubeflow/pipelines/issues/3445
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What can we do from the common's side?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't explain this clearly. This won't be done in kubeflow/common, but in same area. Either operator contributors or KFP contributors need to fill the gap.

I think we are trying to consider from high level what we can do to make operators easily and widely used in KF ecosystem

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM.

@gaocegege
Copy link
Member

LGTM.

Thanks for your contribution! 🎉 👍

ROADMAP.md Outdated

Continue to optimize reconciler performance and reduce latency to take actions on CR events.

* Performance optimization for 500 concurrent jobs and large scale completed jobs. https://github.com/kubeflow/tf-operator/issues/965 https://github.com/kubeflow/tf-operator/issues/1079
Copy link
Member

@johnugeorge johnugeorge Apr 7, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it planned on kubeflow/common implementation ? Kubeflow/Kubebench can also help here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't be inside kubeflow/common, I think it would be better to keep using Kubeflow/Kubebench. Trying to get more feedbacks on common requirements for all different operators. We need to leverage existing solutions.

ROADMAP.md Outdated Show resolved Hide resolved
* MLOps - Allow user to submit jobs using Git repo without building container images.
* Add Job priority and Queue in SchedulingPolicy for advanced scheduling in common operator. Realated Issue: [#46](https://github.com/kubeflow/common/issues/46).
* Add pipeline launcher components for different training jobs. [pipeline#3445](https://github.com/kubeflow/pipelines/issues/3445).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am thinking that there might be many people submitting distributed training job directly using command line and job-operator.
Maybe we can think about how to export the metrics and artifacts to the kubeflow/metadata in this case.
Or currently using metadata sdk in the training code is enough?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ChanYiLin
User can check metrics or artifacts in separate UI, it's kind of isolated. The idea case is

  1. All job based metadata metrics or artifacts can be aggregated at the job level
  2. Support rich metadata like bar chart, confusion metrics, roc curve, etc. Pipeline project has some, but I notice most of the training operator users may not use KFP, we can know from integration between KFP and operators.

@Jeffwan
Copy link
Member Author

Jeffwan commented Apr 12, 2020

Seems no more feedbacks. We can merge this one.

Copy link
Member

@terrytangyuan terrytangyuan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

/lgtm
/approve

@k8s-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: terrytangyuan

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot merged commit c142d96 into kubeflow:master Apr 12, 2020
@Jeffwan Jeffwan deleted the roadmap branch April 12, 2020 22:03
@terrytangyuan
Copy link
Member

terrytangyuan commented Apr 12, 2020

@Jeffwan Thanks! I created separate issues to track each of the items that are not being tracked yet. We can continue the discussion on those issues.

@Jeffwan
Copy link
Member Author

Jeffwan commented Apr 13, 2020

@Jeffwan Thanks! I created separate issues to track each of the items that are not being tracked yet. We can continue the discussion on those issues.

Excellent! Let's talk more details on separate issues

georgkaleido pushed a commit to georgkaleido/common that referenced this pull request Jun 9, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants