Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[manager & worker] Migrate dlk into worker interface #66

Closed
gaocegege opened this issue Apr 20, 2018 · 9 comments
Closed

[manager & worker] Migrate dlk into worker interface #66

gaocegege opened this issue Apr 20, 2018 · 9 comments

Comments

@gaocegege
Copy link
Member

#46 (comment)

How about migrating it to other worker interfaces and refining the role of worker interfaces as below?

  • Tensorflow, pytorch.. operator worker interface: Support distributed learning task.
  • Kubernetes worker interface: For frameworks not supported by kubeflow operator. Only manage single machine task.
@gaocegege gaocegege changed the title [dlk] Migrate it into worker interface [manager & worker] Migrate dlk into worker interface Apr 20, 2018
@ddysher
Copy link
Member

ddysher commented Apr 21, 2018

This might not be the right place to ask, but I think the question I have is loosely related to this issue.

In google vizier, the API presented looks like this:

while (study not done)
    trail = client.GetSuggestion()
    metrics = RunTrial(trial)
    client.CompleteTrial(trial, metrics)

That is, client uses SDK to:

  • query vizier for trial
  • use the trial to run training
  • report metrics back to vizier

In katib, the workflow looks like this (correct me if I'm wrong):

  • query trial is done through katib manager
  • running a trial is done in katib (worker interface), the suggestioned parameters are passed as container entrypoint
  • metrics are not reported back from client; instead, getSuggestion is called with completed trials, which have objective value

study config is used for all these configuration knobs, e.g. training command to run, job configuration, tunable parameters, etc. It occurs to me that study config is doing more than it should do and this is not flexible in the long term. On the other hand, it seems we don't have any discussion around client SDK?

I haven't thought throughly on this, but want to bring up some discussions around the topic.

/cc @gaocegege @YujiOshima @mitake

@gaocegege
Copy link
Member Author

Yeah, I met the same problem when I tried to support tf-operator in katib. I find that it is hard to config the tf-job in the current design since we maintain only one configuration file studyconfig.

@YujiOshima
Copy link
Contributor

@ddysher Yes. In Katib, the while loop is in the katib manager (trialIteration).
I agree it is not flexible.
IMHO, we won't need to prepare new SDK but enrich grpc API.
Currently, once you will do CreateStudy, everything will be done in katib manager.

How about make API more flexible as below?

  • CreateStudy: You can select everything will be done in katib (same as now) or only store the study information to DB.
  • GetSuggestion: You can get suggestions from any suggestion services.
  • CompleteTrial: Report the metrics.
  • RunTrial: You can run trials with any worker interface.

In this way, you can use katib more flexible way.
e.g. You use CreateStudy to save study info to DB.
Then you will use GetSuggestion and CompleteTrial.
You can get trials, run them on your own environment, and report the result to katib.
Make sense?

@gaocegege Does it work for your problem?

@ddysher
Copy link
Member

ddysher commented Apr 22, 2018

yeah, this looks promising and closely mirrors vizier api. we can use a 'bottom-up' approach to design the API - start with the lowest api where users are required to call individual functions themselves; then once we have a clearer pictures of how people are using the api, we can provide an ambassador component like what katib does today, to provide higher-level api for users. WDYT?

How about make API more flexible as below?

@YujiOshima
Copy link
Contributor

@ddysher Great. Then I'm going to try to break down the CreateStudy API to lower level API.

@jlewi
Copy link
Contributor

jlewi commented May 17, 2018

@ddysher

That is, client uses SDK to:
query vizier for trial
use the trial to run training
report metrics back to vizier

Is the proposal to have trainer code call Katib to get parameters? e.g. launch a TFJob that would call GetSuggestion?

My expectation is that the loop

while (study not done)
    trial = client.GetSuggestion()
    metrics = RunTrial(trial)
    client.CompleteTrial(trial, metrics)

Is not part of the TFJob/PyTorch job itself.

Any thought abut how metrics would be reported? A couple of things come to mind

  • One of the replicas in the TFJob could run evaluation and call back to vizier to report metrics using the API
  • RunTrial could actually launch two jobs in sequence e.g
    • TFJob to train the model
    • Job to get and report metrics (e.g. from events file)

Thoughts?

@jlewi
Copy link
Contributor

jlewi commented Jul 7, 2018

What is the current status of this? Having things changed in the current release.

I think I had a similar question to @ddysher in kubeflow/examples#162

I've created #138 requesting a design doc to figure out how the different pieces fit together.

@YujiOshima
Copy link
Contributor

We already fixed this issue.
/close

@ddysher
Copy link
Member

ddysher commented Jul 9, 2018

@jlewi sorry for late reply

The usage above is copied from vizier paper, but I agree that trainer code shouldn't be aware of such low level API. Ideally, launching pod and reporting metrics should be done in some other places (haven't had time to think through this yet). As you've already mentioned, I do think a design doc should come first before we dive into the details.

Is the proposal to have trainer code call Katib to get parameters? e.g. launch a TFJob that would call GetSuggestion?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants