-
Notifications
You must be signed in to change notification settings - Fork 448
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add validation for NAS job in Katib controller #398
Add validation for NAS job in Katib controller #398
Conversation
pkg/controller/studyjob/utils.go
Outdated
return fmt.Errorf("Missing OutputSize in NasConfig.GraphConfig: %v", instance.Spec.NasConfig.GraphConfig) | ||
} | ||
|
||
if instance.Spec.NasConfig.GraphConfig.NumLayers == 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NumLayers is not always in need for all NAS (ENAS need it). I think we need validate NAS config considering suggestion algorithm.
Please also verify if other parameters are general or only ENAS-specified
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hougangliu I think we can check these parameters exactly for Reinforcement Learning algorithm for now. For NumLayers and Step here:
katib/pkg/controller/studyjob/utils.go
Line 287 in d407be8
if jobType == jobTypeNAS && pc.Feasible.Step == "" && pc.ParameterType == katibv1alpha1.ParameterTypeDouble { |
What do you think about it?
/retest |
pkg/controller/studyjob/utils.go
Outdated
if reflect.DeepEqual(pc.Feasible, katibv1alpha1.FeasibleSpace{}) { | ||
return fmt.Errorf("Missing Feasible in ParameterConfig: %v", pc) | ||
} | ||
if pc.ParameterType == katibv1alpha1.ParameterTypeCategorical { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not only categorical but discrete.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In discrete we also use only List structure or something else?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently discrete only need List. But we may need more type for discretes, int or double.
Please treated in the same way as Categorical for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
return validJobErr | ||
} | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
move validateHPJob and validateNASJob into validateStudy, and move validateStudy away from initializeStudy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You want to put validateStudy function before initializeStudy?
Right here: https://github.com/kubeflow/katib/blob/master/pkg/controller/studyjob/studyjob_controller.go#L250 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, validateStudy has been moved to https://github.com/kubeflow/katib/blob/master/pkg/controller/studyjob/validating_webhook.go#L44
pkg/controller/studyjob/utils.go
Outdated
return err | ||
} | ||
return nil | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return validateParameterConfigs
to replace line 215-219
return validErr | ||
} | ||
|
||
if getJobType(instance) != jobTypeNAS { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please use suggestionSpec.suggestionAlgorithm to return jobTypeNAS or jobTypeHP in getJobType to handle the invalid studyjob case: .Spec.NasConfig is not nil and suggestionSpec.suggestionAlgorithm is not nas-like
pkg/controller/studyjob/const.go
Outdated
PyTorchJobWorker = "PyTorchJob" | ||
jobTypeNAS = "NAS" | ||
jobTypeHP = "HP" | ||
suggestionAlgorithmRL = "nasrl" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/suggestionAlgorithmRL/suggestionNASRL
pkg/controller/studyjob/utils.go
Outdated
return fmt.Errorf("Missing Max in ParameterConfig.Feasible: %v", pc.Feasible) | ||
} | ||
if jobType == jobTypeNAS && pc.Feasible.Step == "" && pc.ParameterType == katibv1alpha1.ParameterTypeDouble && suggestionAlgorithm == suggestionAlgorithmRL { | ||
return fmt.Errorf("Missing Step in ParameterConfig.Feasible for NAS job: %v", pc.Feasible) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
define a suggestionNASList (now only nasrl is in)
s/suggestionAlgorithm == suggestionAlgorithmRL/contains(suggestionNASList, suggestionAlgorithm)
so that when we add other nas suggestion, we only need update suggestionNASList
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here. Better way to handle this?
Change GetJobType function Make a list with NAS algorithms
@hougangliu Thank you! I changed it. |
pkg/controller/studyjob/utils.go
Outdated
@@ -237,7 +251,7 @@ func validateNASJob(instance *katibv1alpha1.StudyJob) error { | |||
return fmt.Errorf("Missing OutputSize in NasConfig.GraphConfig: %v", instance.Spec.NasConfig.GraphConfig) | |||
} | |||
|
|||
if instance.Spec.NasConfig.GraphConfig.NumLayers == 0 && instance.Spec.SuggestionSpec.SuggestionAlgorithm == suggestionAlgorithmRL { | |||
if instance.Spec.NasConfig.GraphConfig.NumLayers == 0 && contains(suggestionNASList, instance.Spec.SuggestionSpec.SuggestionAlgorithm) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think line 298 can use contains(suggestionNASList,..), but here I don't think so.
If we add another nas suggestion, it may not need NumLayers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hougangliu You think we should specify this block only for suggestionNASRL
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a better place to put algorithm specific checks other than the controller? This list can grow over time with more algorithms and these checks are very specific to the algorithm implementation. @andreyvelich @hougangliu Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From my point of view, it is important to check it in the controller. We don't need to run getSuggestion if something wrong with yaml file. We can't set failed condition for the StudyJob inside Suggestion. Controller do that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should do algorithm-specific check in suggestion service.
- Add ValidateSuggestionParameter API to Suggestion service
or - GetSuggesiton returns a specific error code if the config is invalid then the controller handle the error code
@andreyvelich @johnugeorge @hougangliu Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@YujiOshima In that case, should we send StudyConfig to ValidateSuggestionParameters, because we don't save StudyConfig in db, if it is not valid.
We can't GetStudy like here: https://github.com/kubeflow/katib/blob/master/pkg/suggestion/nasrl_service.py#L219 ?
As well, should we put this function after this line: https://github.com/kubeflow/katib/blob/master/pkg/controller/studyjob/katib_api_util.go#L43 ?
We generate StudyConfig and can send it to Suggestion to check, whether it is valid or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@andreyvelich You're right. StudyConf and SuggestionParams should be included in an args of ValidateSuggestionParameters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@YujiOshima Ok. In this PR, should I make a validation only for NAS RL service, or we should include BO here, as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@andreyvelich Only for NAS RL is enough here. If you could add validation for BO in another PR, it would be fantastic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@YujiOshima @hougangliu What do you think about name of this API: ValidateStudyParameters
?
It looks confused with ValidateSuggestionParameters
since we validate StudyConfig and SuggestionParameters in this function
pkg/controller/studyjob/const.go
Outdated
) | ||
|
||
var ValidWorkerKindList = [...]string{DefaultJobWorker, TFJobWorker, PyTorchJobWorker} | ||
|
||
var suggestionNASList = []string{suggestionNASRL} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Our goal is to keep controller independent of worker type and suggestion algos. Currently there are no worker specific code in the code(except for watch. Hope, it can be fixed soon and ValidWorkerKindList won't not needed any more). Similarly i feel that there should not be any suggestion algorithm specific changes in the controller code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In that case, how we can make a validation for the NAS job ?
Validation in NAS RL service is not ready |
I created a new API function called ValidateSuggestionParameters and added validation inside NAS RL service, as noticed here #398 (comment).
@YujiOshima @hougangliu @johnugeorge @richardsliu @DeeperMind Please, check it. /hold cancel |
/test kubeflow-katib-presubmit |
return err | ||
} | ||
|
||
log.Println("Validate Parameters inside Suggestion Start") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Start to validate Suggestion Parameters
} else { | ||
instance.Status.Condition = katibv1alpha1.ConditionFailed | ||
return errors.New("Suggestion Parameters is not valid") | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/s/is/are
or Valid suggestion Parameters
and Invalid suggestion Parameters
* Return 1 if Suggestion Parameters is Valid, 0 if not | ||
*/ | ||
message ValidateSuggestionParametersReply { | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bool is_valid 1;
string msg 2;
/test kubeflow-katib-presubmit |
/retest |
3 similar comments
/retest |
/retest |
/retest |
@jlewi @YujiOshima @hougangliu @richardsliu |
/retest |
/retest |
@@ -212,10 +212,20 @@ service Manager { | |||
get: "/api/Manager/GetSavedModels" | |||
}; | |||
}; | |||
/** | |||
* Validate Suggestion Parameters from Study Job. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please explain a suggestion service should return INVALID_ARGUMENT
error when the parameter is invalid.
} | ||
|
||
conn, err := grpc.Dial("vizier-suggestion-"+in.SuggestionAlgorithm+":6789", grpc.WithInsecure()) | ||
if err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should return specific error code here (e.g. Unavailable
) and set message.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@YujiOshima I think we should delete this part:
https://github.com/kubeflow/katib/pull/398/files/ea896ae1a4c4690ef8873ce416a019e782710f9e#diff-9e63f7b0791edcb4a53c944552b88970R119.
Since you add "random" algorithm, if it is not specified:
https://github.com/kubeflow/katib/blob/master/pkg/controller/studyjob/katib_api_util.go#L28
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@YujiOshima By default this Error https://github.com/kubeflow/katib/pull/398/files#diff-9e63f7b0791edcb4a53c944552b88970R120 will return Unavailable code. I think, we don't need to specify this error code there.
Right now, I handle this error here, as you said: b88030b#diff-7a381fdaeae1295724b74c0a04f7644cR386.
Please check, if it is ok.
if statusCode.Code() == codes.Unknown { | ||
log.Printf("Method ValidateSuggestionParameters not found inside Suggestion service: %s", suggestionAlgorithm) | ||
return true | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please handle InvalidArgument
and Unavailable
error and print the reason for the error (the parameter is invalid or service is unavailable).
* Suggestion service should return INVALID_ARGUMENT Error when the parameter is invalid | ||
*/ | ||
rpc ValidateSuggestionParameters(ValidateSuggestionParametersRequest) returns (ValidateSuggestionParametersReply){ | ||
option (google.api.http) = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find that ValidateSuggestionParametersReply is used as return &api.ValidateSuggestionParametersReply{}
all the time. can we drop ValidateSuggestionParametersReply, alway return nothing?
or you can use below struct to replace current one
message ValidateSuggestionParametersReply {
bool is_valid 1;
string msg 2;
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hougangliu We don't need to return anything, like here:
https://github.com/kubeflow/katib/blob/master/cmd/manager/main.go#L213.
If parameter is not valid, we return INVALID_ARGUMENT
Error message.
@YujiOshima @hougangliu Do you have any other suggestions on this PR? |
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: hougangliu The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Fixes: #372.
As well, I added validation for HP job.
/assign @hougangliu @YujiOshima
This change is