[manager & worker] Migrate dlk into worker interface #66

gaocegege · 2018-04-20T06:31:17Z

How about migrating it to other worker interfaces and refining the role of worker interfaces as below?

Tensorflow, pytorch.. operator worker interface: Support distributed learning task.
Kubernetes worker interface: For frameworks not supported by kubeflow operator. Only manage single machine task.

ddysher · 2018-04-21T14:09:06Z

This might not be the right place to ask, but I think the question I have is loosely related to this issue.

In google vizier, the API presented looks like this:

while (study not done)
    trail = client.GetSuggestion()
    metrics = RunTrial(trial)
    client.CompleteTrial(trial, metrics)

That is, client uses SDK to:

query vizier for trial
use the trial to run training
report metrics back to vizier

In katib, the workflow looks like this (correct me if I'm wrong):

query trial is done through katib manager
running a trial is done in katib (worker interface), the suggestioned parameters are passed as container entrypoint
metrics are not reported back from client; instead, getSuggestion is called with completed trials, which have objective value

study config is used for all these configuration knobs, e.g. training command to run, job configuration, tunable parameters, etc. It occurs to me that study config is doing more than it should do and this is not flexible in the long term. On the other hand, it seems we don't have any discussion around client SDK?

I haven't thought throughly on this, but want to bring up some discussions around the topic.

/cc @gaocegege @YujiOshima @mitake

gaocegege · 2018-04-22T02:58:03Z

Yeah, I met the same problem when I tried to support tf-operator in katib. I find that it is hard to config the tf-job in the current design since we maintain only one configuration file studyconfig.

YujiOshima · 2018-04-22T04:03:28Z

@ddysher Yes. In Katib, the while loop is in the katib manager (trialIteration).
I agree it is not flexible.
IMHO, we won't need to prepare new SDK but enrich grpc API.
Currently, once you will do CreateStudy, everything will be done in katib manager.

How about make API more flexible as below?

CreateStudy: You can select everything will be done in katib (same as now) or only store the study information to DB.
GetSuggestion: You can get suggestions from any suggestion services.
CompleteTrial: Report the metrics.
RunTrial: You can run trials with any worker interface.

In this way, you can use katib more flexible way.
e.g. You use CreateStudy to save study info to DB.
Then you will use GetSuggestion and CompleteTrial.
You can get trials, run them on your own environment, and report the result to katib.
Make sense?

@gaocegege Does it work for your problem?

ddysher · 2018-04-22T06:48:07Z

yeah, this looks promising and closely mirrors vizier api. we can use a 'bottom-up' approach to design the API - start with the lowest api where users are required to call individual functions themselves; then once we have a clearer pictures of how people are using the api, we can provide an ambassador component like what katib does today, to provide higher-level api for users. WDYT?

How about make API more flexible as below?

YujiOshima · 2018-04-22T07:10:57Z

@ddysher Great. Then I'm going to try to break down the CreateStudy API to lower level API.

jlewi · 2018-05-17T12:30:04Z

@ddysher

That is, client uses SDK to:
query vizier for trial
use the trial to run training
report metrics back to vizier

Is the proposal to have trainer code call Katib to get parameters? e.g. launch a TFJob that would call GetSuggestion?

My expectation is that the loop

while (study not done)
    trial = client.GetSuggestion()
    metrics = RunTrial(trial)
    client.CompleteTrial(trial, metrics)

Is not part of the TFJob/PyTorch job itself.

Any thought abut how metrics would be reported? A couple of things come to mind

One of the replicas in the TFJob could run evaluation and call back to vizier to report metrics using the API
RunTrial could actually launch two jobs in sequence e.g
- TFJob to train the model
- Job to get and report metrics (e.g. from events file)

Thoughts?

jlewi · 2018-07-07T01:27:53Z

What is the current status of this? Having things changed in the current release.

I think I had a similar question to @ddysher in kubeflow/examples#162

I've created #138 requesting a design doc to figure out how the different pieces fit together.

YujiOshima · 2018-07-09T04:04:38Z

We already fixed this issue.
/close

ddysher · 2018-07-09T08:18:28Z

@jlewi sorry for late reply

The usage above is copied from vizier paper, but I agree that trainer code shouldn't be aware of such low level API. Ideally, launching pod and reporting metrics should be done in some other places (haven't had time to think through this yet). As you've already mentioned, I do think a design doc should come first before we dive into the details.

Is the proposal to have trainer code call Katib to get parameters? e.g. launch a TFJob that would call GetSuggestion?

gaocegege added improvement/enhancement area/manager priority/p3 labels Apr 20, 2018

gaocegege changed the title ~~[dlk] Migrate it into worker interface~~ [manager & worker] Migrate dlk into worker interface Apr 20, 2018

gaocegege mentioned this issue Apr 20, 2018

*: Refactor the structure #65

Merged

This was referenced Apr 23, 2018

Cobra cli #69

Merged

Make low-level API for using katib flexibly #72

Closed

Refine API #74

Merged

jlewi added the area/0.3.0 label Jul 7, 2018

jlewi mentioned this issue Jul 7, 2018

[feature] Integrate with tf-operator, pytorch-operator #39

Closed

k8s-ci-robot assigned YujiOshima Jul 9, 2018

k8s-ci-robot closed this as completed Jul 9, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[manager & worker] Migrate dlk into worker interface #66

[manager & worker] Migrate dlk into worker interface #66

gaocegege commented Apr 20, 2018

ddysher commented Apr 21, 2018 •

edited

Loading

gaocegege commented Apr 22, 2018

YujiOshima commented Apr 22, 2018

ddysher commented Apr 22, 2018

YujiOshima commented Apr 22, 2018

jlewi commented May 17, 2018

jlewi commented Jul 7, 2018

YujiOshima commented Jul 9, 2018

ddysher commented Jul 9, 2018

[manager & worker] Migrate dlk into worker interface #66

[manager & worker] Migrate dlk into worker interface #66

Comments

gaocegege commented Apr 20, 2018

ddysher commented Apr 21, 2018 • edited Loading

gaocegege commented Apr 22, 2018

YujiOshima commented Apr 22, 2018

ddysher commented Apr 22, 2018

YujiOshima commented Apr 22, 2018

jlewi commented May 17, 2018

jlewi commented Jul 7, 2018

YujiOshima commented Jul 9, 2018

ddysher commented Jul 9, 2018

ddysher commented Apr 21, 2018 •

edited

Loading