Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(wg): Add WG Training #356

Merged
merged 8 commits into from
Aug 3, 2020
Merged

Conversation

gaocegege
Copy link
Member

@gaocegege gaocegege commented Jul 6, 2020

/cc @johnugeorge @terrytangyuan @andreyvelich @jlewi

Signed-off-by: Ce Gao gaoce@caicloud.io

Person WG Response
@animeshsingh KFServing, Pipelines
@cliveseldon KFServing
@yuzisun KFServing
@ellistarn KFServing
@neuromage Pipelines/Metadata
@paveldournov Pipelines/Metadata
@elikatsis Pipelines
@vpavlin Control plane
@yanniszark Control Plane/common services
@Jeffwan Control Plane LGTM
@krishnadurai Common Services
@terrytangyuan Training LGTM
@gaocegege Training
@andreyvelich Training
@johnugeorge Training LGTM
@aronchick Metadata
@StefanoFioravanzo Metadata
@jbottum PM
@elviraux PM
@kimwnasptd Notebooks
@krazyhaas

gaocegege added 2 commits July 6, 2020 19:00
Signed-off-by: Ce Gao <gaoce@caicloud.io>
Signed-off-by: Ce Gao <gaoce@caicloud.io>
@kubeflow-bot
Copy link

This change is Reviewable

Signed-off-by: Ce Gao <gaoce@caicloud.io>
proposals/wgs/training/wgs.yaml Outdated Show resolved Hide resolved
proposals/wgs/training/wgs.yaml Outdated Show resolved Hide resolved
gaocegege added 2 commits July 7, 2020 09:42
Signed-off-by: Ce Gao <gaoce@caicloud.io>
Signed-off-by: Ce Gao <gaoce@caicloud.io>
@Jeffwan
Copy link
Member

Jeffwan commented Jul 8, 2020

/lgtm

@Jeffwan
Copy link
Member

Jeffwan commented Jul 8, 2020

Seems we have members from different time zones, meeting time could be early morning or late afternoon in US time. I am in UTC-7.

@gaocegege
Copy link
Member Author

Should we organize two meetings for different time zones?

@terrytangyuan
Copy link
Member

terrytangyuan commented Jul 9, 2020

I am in EST. Another option is to have meetings alternating between two friendly time zones (similar to Kubeflow community meeting). We can fill out a survey to see which way would work better for us.

@gaocegege
Copy link
Member Author

I am in EST. Another option is to have meetings alternating between two friendly time zones (similar to Kubeflow community meeting). We can fill out a survey to see which way would work better for us.

SGTM.

@gaocegege
Copy link
Member Author

Personally, prefer bi-weekly for the WG. I think there are only a few owners/contributors although we have about 10 projects in WG Training.

Weekly meeting will have huge overhead for members. WDYT

@terrytangyuan
Copy link
Member

terrytangyuan commented Jul 10, 2020

Yes, I meant either bi-weekly or monthly is fine to me. I agree that weekly meeting would be too much overhead.

@gaocegege gaocegege mentioned this pull request Jul 10, 2020
@andreyvelich
Copy link
Member

I agree with @terrytangyuan comment: #356 (comment).
Maybe we could have 2 meetings per month: 1 - China / Europe, 2 - U.S. ?

@jlewi
Copy link
Contributor

jlewi commented Jul 13, 2020

@gaocegege and @andreyvelich The scope of the AutoML and Training WGs need to be better defined and once they are I'd like to come back to the question of whether they should be separate WGs or sub projects within the groups.

Here's the scope for the training WG

WG Training covers developing, deploying, and operating training jobs on Kubeflow.

Here's the scope for the AutoML WG

WG AutoML is responsible for all aspects of Automated Machine Learning technologies on Kubeflow.
The WG covers researching, developing and operating various targets of ML automation for Kubeflow.

These scopes are pretty self referential. It doesn't actually tell me whats included in training or AuotML.

They also seem like they are very overlapping. AutoML has to launch training jobs right? But that would be under the purvoy of training.

If I define the scope of training as "learning models from data" then that would be encompassing of both training operators and AutoML.

@gaocegege
Copy link
Member Author

I think WG training is focused on the training part of https://cloud.google.com/ai-platform , and WG AutoML is focused on https://cloud.google.com/automl .

Training is to learn parameters from data, AutoML is to learn models from data.

WDYT @andreyvelich

@andreyvelich
Copy link
Member

andreyvelich commented Jul 14, 2020

I agree with @gaocegege.

Maybe, we should modify scope of WG Training and AutoML to be more precise.

I think main purpose of Training WG is to run ML training jobs, but in AutoML training is just part of workflow.
AutoML can also: store and generate optimised model, store metrics, prepare training data, label training data, have various interfaces and visualisations to communicate between user and optimisation problems etc.

Signed-off-by: Ce Gao <gaoce@caicloud.io>
@k8s-ci-robot k8s-ci-robot removed the lgtm label Jul 21, 2020
@gaocegege
Copy link
Member Author

@jlewi Can we come to an agreement that we will have WG Training and WG AutoML?

@terrytangyuan Your comments are addressed. Please take a look, thanks for your time 😄

@terrytangyuan
Copy link
Member

@gaocegege Thanks. Looks good to me.

/assign @jlewi

@jlewi
Copy link
Contributor

jlewi commented Jul 22, 2020

@gaocegege @andreyvelich still has #358 open; are we going with one WG or 2? If one WG which PR is open.

A couple of issues

  • I'd like to see this proposal be circulated and get broad buy in from other leaders in the KF community

  • Housekeeping I will submit a PR to move the files out of proposals; stay tuned

@andreyvelich
Copy link
Member

@jlewi We still don't come to the final agreement.
These comments are for having separate WG: #356 (comment), #356 (comment), #356 (comment).

@animeshsingh @johnugeorge @karlschriek @vpavlin @jbottum can you give your thoughts about WG Training and WG AutoML, please?

@jlewi
Copy link
Contributor

jlewi commented Jul 23, 2020

@Jeffwan

I think it should be different WG. AutoML is a separate domain even the user experience could be close. AutoML should concentrate more on the methods like Grid/Random search, NAS. it should leverage existing operator to run those experiment

This is a strong argument for making them distinct projects but why distinct WGs?

The charters for AutoML and the training WGs are pretty narrowly defined in my oppinion. The charters as written look more like the project charters then a WG charter.

From a technical perspective, it seems like our AutoML efforts and training efforts are pretty coupled.

The scope of training defined in this doc is:

WG Training covers developing, deploying, and operating training jobs on Kubeflow.

Launching training jobs is pretty fundamental to AutoML/hyperparameter tuning. AutoML/hyperparameter tuning is basically an orchestrator of training jobs.

This means our training APIs ought to be defined with the goal of supporting AutoML and hyperparameter tuning.

e.g per the discussion in kubeflow/katib#1273 (comment) the fact that we don't have a KF resource model that extends the K8s resource models with features like inputs and outputs makes it much harder for AutoML/HP Tuning to orchestrate training jobs.

This is strongly sugests to me that these projects should rollup to a shared set of tech leads/managers who can drive the consensus/compromises needed to ensure these projects play well together.

Another strong reason for a shared WG is to amortize the cost of the supporting infrastructure.

See for example: kubeflow/testing#737. Suppose test and release infrastructure becomes the responsibility of each WG; i.e. each WG is responsible for maintaining its own K8s clusters if needed, its own docker registries, etc.... Do the training and auto ML WGs have the critical mass to support this or would you be better off collaborating?

@andreyvelich
Copy link
Member

andreyvelich commented Jul 24, 2020

Launching training jobs is pretty fundamental to AutoML/hyperparameter tuning. AutoML/hyperparameter tuning is basically an orchestrator of training jobs.
This means our training APIs ought to be defined with the goal of supporting AutoML and hyperparameter tuning.

I am not sure that AutoML is just an orchestrator of training jobs.
Basically, training job is needed only for step of model training in AutoML.
It also includes AutoML algorithms Management, Model Management for NAS, Storage Management to save metrics/nn models/hyperparameters, Visualisation Management.

Also, after implementing kubeflow/katib#1273 (comment), we will be not depend on Kubeflow training operators APIs. Kubeflow training operators just need to follow common principles, which they are already is doing, because they built on top of common repo.

See for example: kubeflow/testing#737. Suppose test and release infrastructure becomes the responsibility of each WG; i.e. each WG is responsible for maintaining its own K8s clusters if needed, its own docker registries, etc.... Do the training and auto ML WGs have the critical mass to support this or would you be better off collaborating?

I believe we can't setup same test infrastructure for training jobs and AutoML. Complete integration tests for AutoML can be very different than for Training operators, because it includes various components.

For docker registry, currently we maintain our own registry: https://hub.docker.com/u/kubeflowkatib for some Katib's examples images.

From my perspective, it is hard to define Scope for Training and AutoML in one WG because they have various goals.

What do you think @gaocegege @terrytangyuan @johnugeorge @Jeffwan ?

@gaocegege
Copy link
Member Author

I also think so. AutoML is a separate topic.

@jlewi
Copy link
Contributor

jlewi commented Jul 27, 2020

The leads for the AutoML WG look like a subset of the leads of the training WG. The training WG looks like it only contains two leads who aren't also in the AutoML WG. If there's very little overlap between these WGs why is there so much overlap in the leads? If the two WGs don't have enough people to have independent leads, is that an indication that we haven't reached critical mass to support two WGs? If AutoML and Training are both meaty and independent topics warranting their own WGs, are the leads going to be oversubscribed trying to lead two WGs?

@gaocegege
Copy link
Member Author

The leads for the AutoML WG look like a subset of the leads of the training WG. The training WG looks like it only contains two leads who aren't also in the AutoML WG. If there's very little overlap between these WGs why is there so much overlap in the leads? If the two WGs don't have enough people to have independent leads, is that an indication that we haven't reached critical mass to support two WGs? If AutoML and Training are both meaty and independent topics warranting their own WGs, are the leads going to be oversubscribed trying to lead two WGs?

I think if we just have one WG for both training and AutoML, then there is no chance to get more contributors/maintainers for AutoML projects. If we have a separate WG for AutoML, we can engage more ppl in this area to get involved into our community, I think.

@andreyvelich
Copy link
Member

Agree with @gaocegege.
Do we need to duplicate people in Chairs and Tech Leads?
Maybe we could ask folks on community meeting who also wants to be in TL stack for Training WG and AutoML WG?

@jlewi
Copy link
Contributor

jlewi commented Jul 29, 2020

I think if we just have one WG for both training and AutoML, then there is no chance to get more contributors/maintainers for
Could you explain this? Why would the WG that owns say Katib impact an individual decisions to contribute to Katib? A contributor would still have the same opportunity to

  • Become a project OWNER for an AutoML project
  • Propose a new AutoML project

From my perspective, it is hard to define Scope for Training and AutoML in one WG because they have various goals.

The fact that all of the AutoML leads are currently also leads in the training WG suggests to me there is some underlying connection between these two. What is the connection? M

@andreyvelich
Copy link
Member

From my perspective, it is hard to define Scope for Training and AutoML in one WG because they have various goals.

The fact that all of the AutoML leads are currently also leads in the training WG suggests to me there is some underlying connection between these two. What is the connection? M

We have the same leads, because we don't have enough contributors currently, but the scopes of these projects are different.

I thought, one of the main purpose of creating WGs is to grow Kubeflow community and we can involve more leads to the WGs.
Later on, we can split chairs and leads between these WG. What do you think?

@johnugeorge
Copy link
Member

@jlewi Yes. currently, there is overlap between leads. But, can we say that one project is dependent on the other? There have been lots of contributors in training area since beginning who were not Katib users/developers. And vice versa is true.

Keeping separate WG for AutoML will help in focussing on newer features happening in that area. It is an active research area and an evolving one. So I expect more activity here in the future. Major focus on training area in recent times was in building common code base and APIs, supporting newer operators. While for Katib, focus is more on new algorithms and features and API is still getting into beta. I feel, these two projects have different directions and discussion points are also different if we look at them carefully. So, technically, having separate WGs make more sense than merging into one.

However given the current state of having common leads across projects and fewer active contributors, I am also concerned if it is an over commitment to have separate WGs in terms of the efforts to be put in (testing,release infra support etc as Jeremy pointed out).

@terrytangyuan
Copy link
Member

terrytangyuan commented Jul 30, 2020

Given the current roadmap of training operators and Katib, it makes sense to me from technical standpoint that these can potentially/eventually be separate WGs as the number of contributors grow.

However, I have similar concerns that @jlewi and @johnugeorge mentioned above. There maybe a lot of infra/testing/releasing efforts and communication overhead involved if there are two separate WGs.

An alternative strategy to consider is to start with one WG so we can all make our best commitment to help it grow and if we realize that we've attracted enough contributors that can potentially become leads, we can start this discussion again and consider gradually rolling out a new WG.

@jlewi
Copy link
Contributor

jlewi commented Jul 30, 2020

Thanks @johnugeorge and @andreyvelich

I think at this point I'm willing to go along with 1 or 2 WGs for training and automl wgs based on whatever the consensus is.

Since this is one of the first WGs to be formally approved. I think it would be good to get an LGTM from some other potential WG leads to ensure charter etc... is well scoped. I might suggest @ellistarn and @animeshsingh as they have been leading the KFServing WG and it would be good to ensure our processes are converging.


### Out of scope

- APIs used for running inference/serving tasks (this falls under the purview of WG Serving).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about hyper parameter tuning?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/cc @andreyvelich @jlewi

We are still considering if we should have WG AutoML. Thus did not add it here.

day: Wednesday
time: "03:00"
tz: PT (Pacific Time)
frequency: monthly - first Wednesday every month
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would recommend a more frequent WG meeting while this is established.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is another meeting that's US friendly (see below). That would probably be sufficient. I would expect most of the communications done asynchronously on GitHub or Slack as members are from different time zones.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gaocegege I would also like to join the training working group, thanks!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ChanYiLin Thanks for your contribution. Added

@ellistarn
Copy link
Contributor

Thanks @johnugeorge and @andreyvelich

I think at this point I'm willing to go along with 1 or 2 WGs for training and automl wgs based on whatever the consensus is.

Since this is one of the first WGs to be formally approved. I think it would be good to get an LGTM from some other potential WG leads to ensure charter etc... is well scoped. I might suggest @ellistarn and @animeshsingh as they have been leading the KFServing WG and it would be good to ensure our processes are converging.

Took a look. Reminds me that @animeshsingh and I need to finally publish our charter. It was in Google Doc form ~1 year ago.

Signed-off-by: Ce Gao <gaoce@caicloud.io>
@johnugeorge
Copy link
Member

johnugeorge commented Jul 31, 2020

Based on the comments from everyone, we can go ahead with 2 WGs - training and AutoML. LGTM from my side

/lgtm

@terrytangyuan
Copy link
Member

Yes, I am also fine with two WGs.

/lgtm

@jlewi
Copy link
Contributor

jlewi commented Aug 3, 2020

/lgtm
/approve

@k8s-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jlewi

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot merged commit f06ab5f into kubeflow:master Aug 3, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.