-
Notifications
You must be signed in to change notification settings - Fork 73
Add distributed training operator roadmap #61
Conversation
ROADMAP.md
Outdated
|
||
* Enhance maintainability of operator common module https://github.com/kubeflow/common/issues/54 | ||
* Migrate operators to use kubeflow/common apis | ||
* Graduate MPI Operator, Mxnet Operator and XGBoost Operator to v1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
YES! We probably need to create concrete plans for each operators (probably in individual repos).
ROADMAP.md
Outdated
|
||
## Features | ||
|
||
To take advatanges of other capabilities of job scheduler components, operators will expose more APIs for advanced scheduling. More features will be added to simplify usage like dynamic volume supports and git ops experiences. In order to make it easily used in the Kubeflow ecosystem, we can add more launcher KFP components for adoption. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you elaborate on "launcher KFP components for adoption"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From our customers, users have different options for training component in KFP. We have cloud components like Sagemaker, lots of users want to use Kubeflow based training operator as well. In order to use TF/MPI smoothly, it would be better to write some common launcher KFP component like https://github.com/kubeflow/pipelines/tree/master/components/kubeflow/katib-launcher
User just need to pass count of PS, Worker, Container Image, argument ,etc to kind of the job and output artifacts. This is much easier compare to writing a yaml and use ResouceOp
My point is we need high level abstraction of job to increase operator adoption in KF ecosystem
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does tf launcher https://github.com/kubeflow/pipelines/tree/master/components/kubeflow/launcher here align to what you describe here?
Then what we need to do next is to introduce other Jobs launcher or provide a common launcher.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. We plan to have a common launcher. Using operator separately will be hard to virtual artifacts, model metrics. We have to use KFP in this case or create separate component to integrate these features for training jobs
ROADMAP.md
Outdated
To take advatanges of other capabilities of job scheduler components, operators will expose more APIs for advanced scheduling. More features will be added to simplify usage like dynamic volume supports and git ops experiences. In order to make it easily used in the Kubeflow ecosystem, we can add more launcher KFP components for adoption. | ||
|
||
* Support dynamic volume provisioning for distributed training jobs https://github.com/kubeflow/common/issues/19 | ||
* MLOPS - Allow user to submit jobs using Git repo without building container images. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MLOPS -> MLOps
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you elaborate? What's the suggested approach?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure if that's a blocker. Based on our customer's feedback, DS doesn't like to build container, the most friendly way to them is to drop a git repo. could be a commit or branch.
Trying to see if others have similar feedbacks or feeling,
Another thing is dataset versioning in MLOPS. I am considering to have a separate component to version dataset and link into training operator. But this is just pretty early thought.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. Though I think git -> image
is a separate use case that we don't want to focus here. There are probably existing solutions there already.
Regarding data versioning, it's not an easy thing to do though frequently requested by many users. Pachyderm is designed to make that easier. Perhaps we can work with pachyderm community on that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah. Agree. Like faring cloud builder uses kaniko to build docker->image
in notebook. That's something already exist. data versioning side, there're few solutions like Pachyderm, DVC which aims to solve the problem, we can consider leveraging these exisitng solutions to make it work in Kubeflow ecosystem.
I think the main challenge today is MLOps is a wide definition. The user's requirement is versioning code, data, reduce the time and difficulty to push models into production, etc.
The reasonable solution will be something suitable to current architecture. We definitely need some product definition for it. I can collect some feedbacks and draft some features later.
ROADMAP.md
Outdated
## Monitoring | ||
|
||
* Provides a better common logger https://github.com/kubeflow/common/issues/60 | ||
* Expose generic prometheus metrics in common operators https://github.com/kubeflow/common/issues/22 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @ywskycn
ROADMAP.md
Outdated
|
||
* Provides a better common logger https://github.com/kubeflow/common/issues/60 | ||
* Expose generic prometheus metrics in common operators https://github.com/kubeflow/common/issues/22 | ||
* Centralized Job Dashboard for training jobs (Add metadata graph, model artifacts later) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to overlap with some existing projects?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
KF had a tf dashboard before but deprecated. Are you talking about that one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a discussion earlier regarding Job dashboard. It was deprecated because users didn't find dashboard much useful. And it is not possible to list all operator spec options in the UI.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing can be, about a common dashboard across operators?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah. I think that would be helpful. We there's enough features for that common dashboard, user will like it.
I think reasonable request will look like
- Check jobs of all CRDs
- basic metadata of the job -> start/end time. job name. resources.
- actions -> terminating job
- metrics -> it would be great to integrate
kubeflow/metadata
info directly in that job, like model artifact, log params. Actually I think we can extend metadata SDK to more rich graphs like line chart, bar graph. It will be helpful if we have job level metrics
ROADMAP.md
Outdated
We will continue developing capabilities for better reliability, scaling, and maintenance of production distributed training experiences provided by operators. | ||
|
||
* Enhance maintainability of operator common module https://github.com/kubeflow/common/issues/54 | ||
* Migrate operators to use kubeflow/common apis |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean tf and pytorch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TF and Pytorch is already on it. PyTorch still import TF libraries. I think mxnet is another one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gotcha.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But, TF and Pytorch is not using common implementation yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tree based model like XGBoost or LightGBM would use this. XGBoost operator is implemented via Common.
ROADMAP.md
Outdated
* Support dynamic volume provisioning for distributed training jobs https://github.com/kubeflow/common/issues/19 | ||
* MLOPS - Allow user to submit jobs using Git repo without building container images. | ||
* Add Job priority and Queue in SchedulingPolicy for advanced scheduling in common operator https://github.com/kubeflow/common/issues/46 | ||
* Add pipeline launcher components for different training jobs. https://github.com/kubeflow/pipelines/issues/3445 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What can we do from the common's side?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't explain this clearly. This won't be done in kubeflow/common, but in same area. Either operator contributors or KFP contributors need to fill the gap.
I think we are trying to consider from high level what we can do to make operators easily and widely used in KF ecosystem
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SGTM.
LGTM. Thanks for your contribution! 🎉 👍 |
ROADMAP.md
Outdated
|
||
Continue to optimize reconciler performance and reduce latency to take actions on CR events. | ||
|
||
* Performance optimization for 500 concurrent jobs and large scale completed jobs. https://github.com/kubeflow/tf-operator/issues/965 https://github.com/kubeflow/tf-operator/issues/1079 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it planned on kubeflow/common implementation ? Kubeflow/Kubebench can also help here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Won't be inside kubeflow/common
, I think it would be better to keep using Kubeflow/Kubebench
. Trying to get more feedbacks on common requirements for all different operators. We need to leverage existing solutions.
* MLOps - Allow user to submit jobs using Git repo without building container images. | ||
* Add Job priority and Queue in SchedulingPolicy for advanced scheduling in common operator. Realated Issue: [#46](https://github.com/kubeflow/common/issues/46). | ||
* Add pipeline launcher components for different training jobs. [pipeline#3445](https://github.com/kubeflow/pipelines/issues/3445). | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am thinking that there might be many people submitting distributed training job directly using command line and job-operator.
Maybe we can think about how to export the metrics and artifacts to the kubeflow/metadata in this case.
Or currently using metadata sdk in the training code is enough?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ChanYiLin
User can check metrics or artifacts in separate UI, it's kind of isolated. The idea case is
- All job based metadata metrics or artifacts can be aggregated at the job level
- Support rich metadata like bar chart, confusion metrics, roc curve, etc. Pipeline project has some, but I notice most of the training operator users may not use KFP, we can know from integration between KFP and operators.
Seems no more feedbacks. We can merge this one. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
/lgtm
/approve
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: terrytangyuan The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@Jeffwan Thanks! I created separate issues to track each of the items that are not being tracked yet. We can continue the discussion on those issues. |
Excellent! Let's talk more details on separate issues |
Let's have more discussion here on the training operator 2020 roadmap.
I think most of the operators are pretty stable and major work is to graduate all the other operators to v1 and rewrite operators to use common apis.
Besides that, we do see some feature request not fulfilled last year and let's have a revisit.
In addition, think big on features/products towards production grade level.
@terrytangyuan @gaocegege @johnugeorge @rongou @richardsliu @jian-he @merlintang @suleisl2000 @jlewi