Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make low-level API for using katib flexibly #72

Closed
YujiOshima opened this issue Apr 24, 2018 · 20 comments
Closed

Make low-level API for using katib flexibly #72

YujiOshima opened this issue Apr 24, 2018 · 20 comments

Comments

@YujiOshima
Copy link
Contributor

YujiOshima commented Apr 24, 2018

Discussde #66
This is the APIs I'm going to refactor and add.

API input process output
CreateStudy StudyConfig Save Study conf to DB
CreateStudyID
StudyID
error
GetSuggestions StudyID
SuggestionAlgorithmName
RequestNum
Create Trials from Suggesiton []TrialID
error
RunTrials StudyID
[]TrialID
Worker
Request to run Trial to worker
Set Trial status running
error
StopTrials StudyID
[]TrialID
IsComplete
Stop Trial worker
Set Trial Status Complete
[]TrialID
error
ShouldStopTrial StudyID
EarlyStopAlgorithm
Get ShuoulStop Trials []TrialID
error
SetSuggestionParameter StudyID
SuggestionAlgorithmName
AlgorithmParam
Set Parameters error
SetEarlyStoppingParameter StudyID
EarlyStoppingAlgorithmName
EarlyStoppingParam
Set Parameters error
GetMetrics []TrialID Get Metrics of Trials []Metrics
error
SaveStudy StudyID Save StudyInfo to ModelDB error
SaveModels StudyID
[]TrialID
[]Metrics
Save Trial and Metrics Info to ModelDB error

Typical usage is like below.

	studyId, _ := grpc.CreateStudy(studyConfig)
	grpc.SetSuggestionParameter(studyId, "random", suggestParam)
	grpc.SetEarlyStoppingParameter(studyId, "medianstopping", earlystopParam)
	grpc.SaveStudy(studyId)
	for IsStudyComleted() {
		trials, _ := grpc.GetSuggesitons(studyId, "random", 10)
		grpc.RunTrials(studyId, trials)
		for {
			metrics, workerState, _ := grpc.GetMetrics(studyId, trials)
			if AllWorkerCompleted(workerState) {
				grpc.CompleteTrial(studyId, trials, true)
				grpc.SaveModels(studyId, trials, metrics)
				break
			}
			shouldStops := grpc.ShouldStopTrial(studyId, trials)
			grpc.CompleteTrial(studyId, shouldStops, false)
			deleteShuldStopsFromTrialList(trials, shouldStops)
		}
	}

WDYT? @ddysher @gaocegege @libbyandhelen

@YujiOshima
Copy link
Contributor Author

/area manager

@ddysher
Copy link
Member

ddysher commented Apr 25, 2018

@YujiOshima thanks for putting this together! I'm on business travel last two days, will take a look ASAP :)

@gaocegege
Copy link
Member

Personally, LGTM

Thanks for your awesome work!

@ddysher
Copy link
Member

ddysher commented Apr 26, 2018

Thanks @YujiOshima, I've listed some of my concerns:

For GetSuggestions API

  • Why do I have to pass SuggestionAlgorithmName and RequestNum to GetSuggestions if a study already has StudyConfig? There's also a SetSuggestionParameter API, which seems to do a similar task.
  • It seems GetSuggestions always run synchronously? is it necessary to provide support from asynchronous trials generation?

For RunTrials API

  • What's the Worker Parameter in RunTrails?
  • What about runtime configuration for running trials, e.g. how do I pass what the script to run, the resource limits for a single trial, etc? Kind of like the problem we've discussed in [manager & worker] Migrate dlk into worker interface #66

For SaveStudy API

  • This is a little confusing since from a user's perspective, Study is already saved when study is created; the SaveStudy is about saving StudyInfo to ModelDB, rather than saving it to core katib. It occurs to me we might need a separate API group for this?

@YujiOshima YujiOshima mentioned this issue Apr 26, 2018
@YujiOshima
Copy link
Contributor Author

@ddysher Thank you for a comment!
I open PR about this. #74

For GetSuggestions API

I made SutudyConfig more simple. You can specify the number of trials at each suggestion request.
I think it can be run asynchrony but not test.

For RunTrials API

Worker means runtime e.g. kubernetes, 'TFoperator, I renamed it on my PR. The runtime config is api.WorkerConfig`

For SaveStudy API

I agree. I changed the CreateStudy include the SaveStudy.

I add a simple demo using minikube https://github.com/YujiOshima/hp-tuning/blob/7a7086d3336f284d1ea67f2b06051d2c12d3922c/docs/MinikubeDemo/MinikubeDemo.md

Please take a look!

@YujiOshima
Copy link
Contributor Author

@ddysher
Copy link
Member

ddysher commented Apr 27, 2018

@YujiOshima thanks! I'll take a look at the PR later.

@libbyandhelen
Copy link
Contributor

@YujiOshima
So based on your PR and some modifications for supporting CMA-ES and BO, I drew a diagram to illustrate the main workflow.
untitled diagram

The modifications are the followings:

  • A new grpc function GetTrial in api.proto to get trial by trial_id.
    Each intermediate result should be correspond to one trial, so I store the trial_id in each intermediate result in SuggestionParameter, and found it easier to have a get trial function by trial_id instead of study_id.
    for example, a suggestion parameter for the intermediate result would look like this:
{
  "name": "population",
  "value": "{\"trial_id\": \"A8UwJqEmK9SpzyMO\", \"x\": \"[0.22899100143490736, 0.23124807755799998]\", \"y\": \"\", \"penalty\": 0}"
}
  • it is the service who sends the create_trial request
    In your PR, after the manager send get_suggestions request to the service, it will receive a list of trials. Then it loops through the trials and save each of them to database. Again, there is a one-to-one relationship between intermediate result and trial, so it is more natural for me to create trials and save intermediate result(set_suggestion_parameters) at the same time, since trial_id is needed when saving the intermediate result. Therefore, both of these are done in service side.

  • a new grpc function UpdateTrial in api.proto to update status and objective value after evaluation.

  • a new grpc function GetSuggestionParameterList in api.proto
    I the original PR, one can only get suggestion parameters by param_id, but maybe it is more convenient to get a suggestion parameter pack containing all useful information by study_id.
    So the request and reply protocols are:

message GetSuggestionParameterListRequest {
    string study_id = 1;
}

message GetSuggestionParameterListReply {
    message SuggestionParameterSet {
        string param_id = 1;
        string param_name = 2;
        repeated SuggestionParameter suggestion_parameters = 3;
    }
    repeated SuggestionParameterSet suggestion_parameter_set = 1;
}
{
  "name": "path_c",
  "value": "[[-0.39959364763308924], [0.010550492832451075]]"
}

other notes:

  • Here is the whole commit: 3496694

  • I rewrite some relevant function (simplified version) in python for test and illustration of idea.

  • I use a function call to substitute the steps in yellow block in the diagram, because these are independent of the algorithm itself

  • For the next step, how can we use the go interface in python?

@YujiOshima
Copy link
Contributor Author

@libbyandhelen
Thank you! Cool!

A new grpc function GetTrial in api.proto to get trial by trial_id.

In new API, the Trial is an only parameter set and the Worker is an instance of the evaluation process of a trial.
Then an intermediate result is corresponding to one worker.
And multiple workers can be corresponding to one trial.
So how about get worker_id list from trial_id? And you can get an intermediate result by calling GetMetricsrpc withworker_id`.

message GetWorkersFromTrialRequest {
    string trial_id = 1;
}

it is the service who sends the create_trial request

SGTM.
I agree it is more natural that suggestion services call CreateTrial.

a new grpc function UpdateTrial in api.proto to update status and objective value after evaluation.

Same as the first comment, the objective value corresponds to Worker.
And you can update and get value with GetMetrics rpc.

a new grpc function GetSuggestionParameterList in api.proto

SGTM.
I'm going to add the rpc.

A minor change in string join and split

Parameters and Tags are encoded [here](https://github.com/YujiOshima/hp- tuning/blob/cc2ddbea3a4ca672ba45da7af93c21d76ec3859b/pkg/db/interface.go#L354)
I think it is not a problem.
Minimal encoding and decoding code is https://play.golang.org/p/wz-ML98fJW8 .

@libbyandhelen
Copy link
Contributor

@YujiOshima
Thank you!
You said that multiple workers can be correspond to one trial. Then what is the relationship between these workers. Are their metrics the same? Or they are for different objectives? If so, since the current algorithms do not support muti-objectives, can I safely assume that one trial is only correspond to one worker?

@YujiOshima
Copy link
Contributor Author

@libbyandhelen
In my assumption, when users want to train the same parameter with several initial-values or need the variance of the result, multi workers are created from one Trial.
So the objective and metrics are the same among workers (values of them may be different.)

@libbyandhelen
Copy link
Contributor

@YujiOshima
OK, cool. Then maybe I can use the mean of all the metric values as the objective value.

@libbyandhelen
Copy link
Contributor

libbyandhelen commented May 11, 2018

@YujiOshima
I still have a question about get metrics
the structure of the getMetricsReply seems to be like this:

[
    {
        worker_id,
        [{name, [value1, value2, ...]}, {name, [value1, value2, ...]}]
    },
    ...
]

So the question is what is the "name" and the list of values for? Isn't a worker only has one value?

@YujiOshima
Copy link
Contributor Author

The metric is not only for objective value.
For example, the objective value is accuracy, but you may want to collect loss, recall etc.
The names of metrics are defined in study config.
And Katib will collect all logs of each metrics value.
So when you want to get the latest objective value, set the name of objective value to GetMetricsRequest.metrics_names and get getMetricsReply.metrics_log_sets.metrics_logs.vlues[-1]

@libbyandhelen
Copy link
Contributor

@YujiOshima
I am trying to rewrite the cma-es algorithm using the new API, and I get this error:
grpc._channel._Rendezvous: <_Rendezvous of RPC that terminated with (StatusCode.UNKNOWN, sql: expected 4 destination arguments in Scan, not 3)>

is this because of this line of code?
https://github.com/YujiOshima/hp-tuning/blob/80faafc4188557d0d1930b32abf0451fd553e0fa/pkg/db/interface.go#L848

@lluunn
Copy link
Contributor

lluunn commented May 24, 2018

cc @lluunn

@YujiOshima
Copy link
Contributor Author

@libbyandhelen Oh, I'm sorry for my mistake.
I will open PR to fix it.

@jlewi
Copy link
Contributor

jlewi commented Oct 9, 2018

/area 0.4.0

Can we close this issue? Is there more work to be done?

@YujiOshima
Copy link
Contributor Author

This is completed.
/close

@k8s-ci-robot
Copy link

@YujiOshima: Closing this issue.

In response to this:

This is completed.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants