feat(wg): Add WG Training #356

gaocegege · 2020-07-06T11:06:40Z

/cc @johnugeorge @terrytangyuan @andreyvelich @jlewi

Person	WG	Response
@animeshsingh	KFServing, Pipelines
@cliveseldon	KFServing
@yuzisun	KFServing
@ellistarn	KFServing
@neuromage	Pipelines/Metadata
@paveldournov	Pipelines/Metadata
@elikatsis	Pipelines
@vpavlin	Control plane
@yanniszark	Control Plane/common services
@Jeffwan	Control Plane	LGTM
@krishnadurai	Common Services
@terrytangyuan	Training	LGTM
@gaocegege	Training
@andreyvelich	Training
@johnugeorge	Training	LGTM
@aronchick	Metadata
@StefanoFioravanzo	Metadata
@jbottum	PM
@elviraux	PM
@kimwnasptd	Notebooks
@krazyhaas

Signed-off-by: Ce Gao <gaoce@caicloud.io>

kubeflow-bot · 2020-07-06T11:06:47Z

This change is

Signed-off-by: Ce Gao <gaoce@caicloud.io>

proposals/wgs/training/wgs.yaml

proposals/wgs/training/wg-charter-training.md

Signed-off-by: Ce Gao <gaoce@caicloud.io>

Jeffwan · 2020-07-08T23:34:14Z

/lgtm

Jeffwan · 2020-07-08T23:36:45Z

Seems we have members from different time zones, meeting time could be early morning or late afternoon in US time. I am in UTC-7.

gaocegege · 2020-07-09T02:05:26Z

Should we organize two meetings for different time zones?

terrytangyuan · 2020-07-09T12:42:20Z

I am in EST. Another option is to have meetings alternating between two friendly time zones (similar to Kubeflow community meeting). We can fill out a survey to see which way would work better for us.

gaocegege · 2020-07-10T06:06:17Z

I am in EST. Another option is to have meetings alternating between two friendly time zones (similar to Kubeflow community meeting). We can fill out a survey to see which way would work better for us.

SGTM.

gaocegege · 2020-07-10T06:12:46Z

Personally, prefer bi-weekly for the WG. I think there are only a few owners/contributors although we have about 10 projects in WG Training.

Weekly meeting will have huge overhead for members. WDYT

terrytangyuan · 2020-07-10T08:55:28Z

Yes, I meant either bi-weekly or monthly is fine to me. I agree that weekly meeting would be too much overhead.

andreyvelich · 2020-07-10T17:15:23Z

I agree with @terrytangyuan comment: #356 (comment).
Maybe we could have 2 meetings per month: 1 - China / Europe, 2 - U.S. ?

jlewi · 2020-07-13T13:57:00Z

@gaocegege and @andreyvelich The scope of the AutoML and Training WGs need to be better defined and once they are I'd like to come back to the question of whether they should be separate WGs or sub projects within the groups.

Here's the scope for the training WG

WG Training covers developing, deploying, and operating training jobs on Kubeflow.

Here's the scope for the AutoML WG

WG AutoML is responsible for all aspects of Automated Machine Learning technologies on Kubeflow.
The WG covers researching, developing and operating various targets of ML automation for Kubeflow.

These scopes are pretty self referential. It doesn't actually tell me whats included in training or AuotML.

They also seem like they are very overlapping. AutoML has to launch training jobs right? But that would be under the purvoy of training.

If I define the scope of training as "learning models from data" then that would be encompassing of both training operators and AutoML.

gaocegege · 2020-07-14T02:33:29Z

I think WG training is focused on the training part of https://cloud.google.com/ai-platform , and WG AutoML is focused on https://cloud.google.com/automl .

Training is to learn parameters from data, AutoML is to learn models from data.

WDYT @andreyvelich

andreyvelich · 2020-07-14T18:11:38Z

I agree with @gaocegege.

Maybe, we should modify scope of WG Training and AutoML to be more precise.

I think main purpose of Training WG is to run ML training jobs, but in AutoML training is just part of workflow.
AutoML can also: store and generate optimised model, store metrics, prepare training data, label training data, have various interfaces and visualisations to communicate between user and optimisation problems etc.

Signed-off-by: Ce Gao <gaoce@caicloud.io>

gaocegege · 2020-07-21T07:43:28Z

@jlewi Can we come to an agreement that we will have WG Training and WG AutoML?

@terrytangyuan Your comments are addressed. Please take a look, thanks for your time 😄

terrytangyuan · 2020-07-21T18:47:38Z

@gaocegege Thanks. Looks good to me.

/assign @jlewi

jlewi · 2020-07-22T21:48:33Z

@gaocegege @andreyvelich still has #358 open; are we going with one WG or 2? If one WG which PR is open.

A couple of issues

I'd like to see this proposal be circulated and get broad buy in from other leaders in the KF community
- A good place to start would be to ping the folks listed in Proposal for Kubeflow WG Guidelines/Governance #348 and get them to review.
Housekeeping I will submit a PR to move the files out of proposals; stay tuned

andreyvelich · 2020-07-22T21:59:01Z

@jlewi We still don't come to the final agreement.
These comments are for having separate WG: #356 (comment), #356 (comment), #356 (comment).

@animeshsingh @johnugeorge @karlschriek @vpavlin @jbottum can you give your thoughts about WG Training and WG AutoML, please?

jlewi · 2020-07-23T00:06:29Z

@Jeffwan

I think it should be different WG. AutoML is a separate domain even the user experience could be close. AutoML should concentrate more on the methods like Grid/Random search, NAS. it should leverage existing operator to run those experiment

This is a strong argument for making them distinct projects but why distinct WGs?

The charters for AutoML and the training WGs are pretty narrowly defined in my oppinion. The charters as written look more like the project charters then a WG charter.

From a technical perspective, it seems like our AutoML efforts and training efforts are pretty coupled.

The scope of training defined in this doc is:

WG Training covers developing, deploying, and operating training jobs on Kubeflow.

Launching training jobs is pretty fundamental to AutoML/hyperparameter tuning. AutoML/hyperparameter tuning is basically an orchestrator of training jobs.

This means our training APIs ought to be defined with the goal of supporting AutoML and hyperparameter tuning.

e.g per the discussion in kubeflow/katib#1273 (comment) the fact that we don't have a KF resource model that extends the K8s resource models with features like inputs and outputs makes it much harder for AutoML/HP Tuning to orchestrate training jobs.

This is strongly sugests to me that these projects should rollup to a shared set of tech leads/managers who can drive the consensus/compromises needed to ensure these projects play well together.

Another strong reason for a shared WG is to amortize the cost of the supporting infrastructure.

See for example: kubeflow/testing#737. Suppose test and release infrastructure becomes the responsibility of each WG; i.e. each WG is responsible for maintaining its own K8s clusters if needed, its own docker registries, etc.... Do the training and auto ML WGs have the critical mass to support this or would you be better off collaborating?

andreyvelich · 2020-07-24T16:17:28Z

Launching training jobs is pretty fundamental to AutoML/hyperparameter tuning. AutoML/hyperparameter tuning is basically an orchestrator of training jobs.
This means our training APIs ought to be defined with the goal of supporting AutoML and hyperparameter tuning.

I am not sure that AutoML is just an orchestrator of training jobs.
Basically, training job is needed only for step of model training in AutoML.
It also includes AutoML algorithms Management, Model Management for NAS, Storage Management to save metrics/nn models/hyperparameters, Visualisation Management.

Also, after implementing kubeflow/katib#1273 (comment), we will be not depend on Kubeflow training operators APIs. Kubeflow training operators just need to follow common principles, which they are already is doing, because they built on top of common repo.

See for example: kubeflow/testing#737. Suppose test and release infrastructure becomes the responsibility of each WG; i.e. each WG is responsible for maintaining its own K8s clusters if needed, its own docker registries, etc.... Do the training and auto ML WGs have the critical mass to support this or would you be better off collaborating?

I believe we can't setup same test infrastructure for training jobs and AutoML. Complete integration tests for AutoML can be very different than for Training operators, because it includes various components.

For docker registry, currently we maintain our own registry: https://hub.docker.com/u/kubeflowkatib for some Katib's examples images.

From my perspective, it is hard to define Scope for Training and AutoML in one WG because they have various goals.

What do you think @gaocegege @terrytangyuan @johnugeorge @Jeffwan ?

gaocegege · 2020-07-27T02:15:53Z

I also think so. AutoML is a separate topic.

jlewi · 2020-07-27T23:47:55Z

The leads for the AutoML WG look like a subset of the leads of the training WG. The training WG looks like it only contains two leads who aren't also in the AutoML WG. If there's very little overlap between these WGs why is there so much overlap in the leads? If the two WGs don't have enough people to have independent leads, is that an indication that we haven't reached critical mass to support two WGs? If AutoML and Training are both meaty and independent topics warranting their own WGs, are the leads going to be oversubscribed trying to lead two WGs?

gaocegege · 2020-07-28T01:13:37Z

The leads for the AutoML WG look like a subset of the leads of the training WG. The training WG looks like it only contains two leads who aren't also in the AutoML WG. If there's very little overlap between these WGs why is there so much overlap in the leads? If the two WGs don't have enough people to have independent leads, is that an indication that we haven't reached critical mass to support two WGs? If AutoML and Training are both meaty and independent topics warranting their own WGs, are the leads going to be oversubscribed trying to lead two WGs?

I think if we just have one WG for both training and AutoML, then there is no chance to get more contributors/maintainers for AutoML projects. If we have a separate WG for AutoML, we can engage more ppl in this area to get involved into our community, I think.

andreyvelich · 2020-07-28T23:44:06Z

Agree with @gaocegege.
Do we need to duplicate people in Chairs and Tech Leads?
Maybe we could ask folks on community meeting who also wants to be in TL stack for Training WG and AutoML WG?

jlewi · 2020-07-29T03:52:45Z

I think if we just have one WG for both training and AutoML, then there is no chance to get more contributors/maintainers for
Could you explain this? Why would the WG that owns say Katib impact an individual decisions to contribute to Katib? A contributor would still have the same opportunity to

Become a project OWNER for an AutoML project
Propose a new AutoML project

From my perspective, it is hard to define Scope for Training and AutoML in one WG because they have various goals.

The fact that all of the AutoML leads are currently also leads in the training WG suggests to me there is some underlying connection between these two. What is the connection? M

andreyvelich · 2020-07-30T15:36:20Z

From my perspective, it is hard to define Scope for Training and AutoML in one WG because they have various goals.

The fact that all of the AutoML leads are currently also leads in the training WG suggests to me there is some underlying connection between these two. What is the connection? M

We have the same leads, because we don't have enough contributors currently, but the scopes of these projects are different.

I thought, one of the main purpose of creating WGs is to grow Kubeflow community and we can involve more leads to the WGs.
Later on, we can split chairs and leads between these WG. What do you think?

johnugeorge · 2020-07-30T18:37:47Z

@jlewi Yes. currently, there is overlap between leads. But, can we say that one project is dependent on the other? There have been lots of contributors in training area since beginning who were not Katib users/developers. And vice versa is true.

Keeping separate WG for AutoML will help in focussing on newer features happening in that area. It is an active research area and an evolving one. So I expect more activity here in the future. Major focus on training area in recent times was in building common code base and APIs, supporting newer operators. While for Katib, focus is more on new algorithms and features and API is still getting into beta. I feel, these two projects have different directions and discussion points are also different if we look at them carefully. So, technically, having separate WGs make more sense than merging into one.

However given the current state of having common leads across projects and fewer active contributors, I am also concerned if it is an over commitment to have separate WGs in terms of the efforts to be put in (testing,release infra support etc as Jeremy pointed out).

terrytangyuan · 2020-07-30T19:08:31Z

Given the current roadmap of training operators and Katib, it makes sense to me from technical standpoint that these can potentially/eventually be separate WGs as the number of contributors grow.

However, I have similar concerns that @jlewi and @johnugeorge mentioned above. There maybe a lot of infra/testing/releasing efforts and communication overhead involved if there are two separate WGs.

An alternative strategy to consider is to start with one WG so we can all make our best commitment to help it grow and if we realize that we've attracted enough contributors that can potentially become leads, we can start this discussion again and consider gradually rolling out a new WG.

jlewi · 2020-07-30T22:11:28Z

Thanks @johnugeorge and @andreyvelich

I think at this point I'm willing to go along with 1 or 2 WGs for training and automl wgs based on whatever the consensus is.

Since this is one of the first WGs to be formally approved. I think it would be good to get an LGTM from some other potential WG leads to ensure charter etc... is well scoped. I might suggest @ellistarn and @animeshsingh as they have been leading the KFServing WG and it would be good to ensure our processes are converging.

ellistarn · 2020-07-30T22:13:54Z

proposals/wgs/training/wg-charter-training.md

+
+### Out of scope
+
+- APIs used for running inference/serving tasks (this falls under the purview of WG Serving).


What about hyper parameter tuning?

/cc @andreyvelich @jlewi

We are still considering if we should have WG AutoML. Thus did not add it here.

ellistarn · 2020-07-30T22:15:33Z

proposals/wgs/training/wgs.yaml

+    day: Wednesday
+    time: "03:00"
+    tz: PT (Pacific Time)
+    frequency: monthly - first Wednesday every month


I would recommend a more frequent WG meeting while this is established.

There is another meeting that's US friendly (see below). That would probably be sufficient. I would expect most of the communications done asynchronously on GitHub or Slack as members are from different time zones.

@gaocegege I would also like to join the training working group, thanks!

@ChanYiLin Thanks for your contribution. Added

ellistarn · 2020-07-30T22:16:29Z

Thanks @johnugeorge and @andreyvelich

I think at this point I'm willing to go along with 1 or 2 WGs for training and automl wgs based on whatever the consensus is.

Since this is one of the first WGs to be formally approved. I think it would be good to get an LGTM from some other potential WG leads to ensure charter etc... is well scoped. I might suggest @ellistarn and @animeshsingh as they have been leading the KFServing WG and it would be good to ensure our processes are converging.

Took a look. Reminds me that @animeshsingh and I need to finally publish our charter. It was in Google Doc form ~1 year ago.

Signed-off-by: Ce Gao <gaoce@caicloud.io>

johnugeorge · 2020-07-31T05:39:10Z

Based on the comments from everyone, we can go ahead with 2 WGs - training and AutoML. LGTM from my side

/lgtm

terrytangyuan · 2020-07-31T14:54:48Z

Yes, I am also fine with two WGs.

/lgtm

jlewi · 2020-08-03T14:36:01Z

/lgtm
/approve

k8s-ci-robot · 2020-08-03T14:36:08Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jlewi

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [jlewi]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

gaocegege added 2 commits July 6, 2020 19:00

feat(wg): Add WG Training

f4dcc3f

Signed-off-by: Ce Gao <gaoce@caicloud.io>

feat: Add tech leads

45f8e44

Signed-off-by: Ce Gao <gaoce@caicloud.io>

k8s-ci-robot requested review from andreyvelich, jlewi, johnugeorge and terrytangyuan July 6, 2020 11:06

k8s-ci-robot added the do-not-merge/work-in-progress label Jul 6, 2020

googlebot added the cla: yes label Jul 6, 2020

k8s-ci-robot added the size/L label Jul 6, 2020

feat: Add common

df9404b

Signed-off-by: Ce Gao <gaoce@caicloud.io>

terrytangyuan reviewed Jul 6, 2020

View reviewed changes

proposals/wgs/training/wgs.yaml Outdated Show resolved Hide resolved

proposals/wgs/training/wgs.yaml Outdated Show resolved Hide resolved

andreyvelich reviewed Jul 6, 2020

View reviewed changes

proposals/wgs/training/wg-charter-training.md Show resolved Hide resolved

gaocegege added 2 commits July 7, 2020 09:42

fix: Fix typo

1f22ddb

Signed-off-by: Ce Gao <gaoce@caicloud.io>

fix: Add automl coordination

090794d

Signed-off-by: Ce Gao <gaoce@caicloud.io>

Jeffwan approved these changes Jul 8, 2020

View reviewed changes

k8s-ci-robot assigned Jeffwan Jul 8, 2020

k8s-ci-robot added the lgtm label Jul 8, 2020

gaocegege mentioned this pull request Jul 10, 2020

Add WG AutoML #358

Merged

fix: Add friendly to avoid conflicts

1096db2

Signed-off-by: Ce Gao <gaoce@caicloud.io>

k8s-ci-robot removed the lgtm label Jul 21, 2020

k8s-ci-robot assigned jlewi Jul 21, 2020

ellistarn reviewed Jul 30, 2020

View reviewed changes

k8s-ci-robot requested a review from andreyvelich July 31, 2020 04:17

feat: Add Jack Lin

46fde85

Signed-off-by: Ce Gao <gaoce@caicloud.io>

k8s-ci-robot assigned johnugeorge Jul 31, 2020

k8s-ci-robot added the lgtm label Jul 31, 2020

k8s-ci-robot added the approved label Aug 3, 2020

k8s-ci-robot merged commit f06ab5f into kubeflow:master Aug 3, 2020


		### Out of scope

		- APIs used for running inference/serving tasks (this falls under the purview of WG Serving).

feat(wg): Add WG Training #356

feat(wg): Add WG Training #356

Conversation

gaocegege commented Jul 6, 2020 • edited by terrytangyuan Loading

kubeflow-bot commented Jul 6, 2020

Jeffwan commented Jul 8, 2020

Jeffwan commented Jul 8, 2020

gaocegege commented Jul 9, 2020

terrytangyuan commented Jul 9, 2020 • edited Loading

gaocegege commented Jul 10, 2020

gaocegege commented Jul 10, 2020

terrytangyuan commented Jul 10, 2020 • edited Loading

andreyvelich commented Jul 10, 2020

jlewi commented Jul 13, 2020

gaocegege commented Jul 14, 2020

andreyvelich commented Jul 14, 2020 • edited Loading

gaocegege commented Jul 21, 2020

terrytangyuan commented Jul 21, 2020

jlewi commented Jul 22, 2020

andreyvelich commented Jul 22, 2020

jlewi commented Jul 23, 2020

andreyvelich commented Jul 24, 2020 • edited Loading

gaocegege commented Jul 27, 2020

jlewi commented Jul 27, 2020

gaocegege commented Jul 28, 2020

andreyvelich commented Jul 28, 2020

jlewi commented Jul 29, 2020

andreyvelich commented Jul 30, 2020

johnugeorge commented Jul 30, 2020

terrytangyuan commented Jul 30, 2020 • edited Loading

jlewi commented Jul 30, 2020

ellistarn Jul 30, 2020

Choose a reason for hiding this comment

gaocegege Jul 31, 2020

Choose a reason for hiding this comment

ellistarn Jul 30, 2020

Choose a reason for hiding this comment

terrytangyuan Jul 31, 2020

Choose a reason for hiding this comment

ChanYiLin Jul 31, 2020

Choose a reason for hiding this comment

gaocegege Jul 31, 2020

Choose a reason for hiding this comment

ellistarn commented Jul 30, 2020

johnugeorge commented Jul 31, 2020 • edited Loading

terrytangyuan commented Jul 31, 2020

jlewi commented Aug 3, 2020

k8s-ci-robot commented Aug 3, 2020

gaocegege commented Jul 6, 2020 •

edited by terrytangyuan

Loading

terrytangyuan commented Jul 9, 2020 •

edited

Loading

terrytangyuan commented Jul 10, 2020 •

edited

Loading

andreyvelich commented Jul 14, 2020 •

edited

Loading

andreyvelich commented Jul 24, 2020 •

edited

Loading

terrytangyuan commented Jul 30, 2020 •

edited

Loading

johnugeorge commented Jul 31, 2020 •

edited

Loading