Extend gRPC controller-algorithm communication & Add Suggestion state Exhausted #1213

elikatsis · 2020-06-09T15:46:00Z

This PR targets #1122 and is a first iteration on the proposed solution.

Special notes for your reviewer:

This PR is not ready for merging since tests are not updated accordingly. But I will update/extend tests when we get to a final iteration of this PR.

You will also notice a few commits for the experiment to accept and consume all generated suggestions. This occurs when the controller requests, let's say 3 new suggestions (because it has the room to start another 3 trials) but the algorithm service cannot generate that many.
I can submit it as a separate issue & PR if you like. Currently, the PR contains all the features that are required for our use cases and is mainly here to show you the full picture.

Release note:

TODO: Add release-note

cc @StefanoFioravanzo @andreyvelich @gaocegege

Adding the hold label. Feel free to add another label (e.g., wip) if you feel it suits better.
/hold

Signed-off-by: Ilias Katsakioris <elikatsis@arrikto.com>

If get suggestions RPC fails when receiving less assignments than the requested amount, we lose assignments and, thus, have less trials to execute. With this commit, we log the fact that there is a difference between the two, however we proceed with experiment execution. Signed-off-by: Ilias Katsakioris <elikatsis@arrikto.com>

Signed-off-by: Ilias Katsakioris <elikatsis@arrikto.com>

When there is no Experiment goal, mark the Experiment as succeeded. Else, if the goal is not reached, mark it as failed.

k8s-ci-robot · 2020-06-09T15:46:05Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please assign johnugeorge
You can assign the PR to them by writing /assign @johnugeorge in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

kubeflow-bot · 2020-06-09T15:46:08Z

This change is

andreyvelich

@elikatsis Thank you for doing this.
I think from now any new features that we implement, we should submit in v1beta1 version.

Also, as I mentioned here: #1122 (comment), can you provide any other use-case when Suggestion can come to Exhausted state not in grid algorithm?

I think it is crucial for data scientists to submit Experiment with the correct search space.
For example: user can define search space with one inappropriate parameter.
Suggestion starts to generate Trials and on x-step Suggestion generate error, like grid did.

Because of that, ValidateAlgorithmSettings is helping user to create correct Experiment in advance.
Also, maybe they want to create a Pipeline of succeeded Experiments and they need to know in advance that all Experiment parameters are correct.

My suggestion is to use Exhausted/Failed state of Suggestion, when GetSuggestion can't produce new assignment because of the historical Trials results. For example, in ObservationLogs table string metric value was recorded and GetSuggestion failed. We can't reproduce this situation in advance and handle it in ValidateAlgorithmSettings.

What do you think @johnugeorge @gaocegege ?

gaocegege · 2020-06-10T01:52:26Z

@andreyvelich 's suggestion SGTM.

And I also suggest moving it to v1beta1

johnugeorge · 2020-06-10T14:38:38Z

Same thought. Do we have a situation where we cannot validate the invalid configuration?

elikatsis · 2020-06-12T09:32:43Z

I think from now any new features that we implement, we should submit in v1beta1 version.

Ack! I can transfer the changes. It's that way because some commits were made before moving to v1beta1.

Also, as I mentioned here: #1122 (comment), can you provide any other use-case when Suggestion can come to Exhausted state not in grid algorithm?

I don't have insight on the algorithms, but I think what other algorithms do is unrelated to the bigger picture. Our argument is that other algorithms could benefit from such an approach, because it essentially enables actual gRPC communication between the controller and the service, for whatever reason, and in the same time being backwards compatible.

I think it is crucial for data scientists to submit Experiment with the correct search space.
For example: user can define search space with one inappropriate parameter.
Suggestion starts to generate Trials and on x-step Suggestion generate error, like grid did.

Because of that, ValidateAlgorithmSettings is helping user to create correct Experiment in advance.
Also, maybe they want to create a Pipeline of succeeded Experiments and they need to know in advance that all Experiment parameters are correct.

I don't really get this based on our understanding. How can a search space be incorrect/invalid?
Maximum number of trials is unrelated to the search space. It refers to a maximum number of jobs, so it defines an upper bound and should have nothing to do with a lower bound.
That is, if I declare a maximum of 5 trials, I'm good if my algorithm can produce 3 suggestions. I just need feedback that the experiment is now over.
As users of Katib, that's what we've seen: a data scientist thinks of max trials as a number from which point they don't want to keep running jobs. If what they want to do is not satisfied (goal or no-goal) they iterate and re-submit. They don't really correlate it with the possible combinations.

My suggestion is to use Exhausted/Failed state of Suggestion, when GetSuggestion can't produce new assignment because of the historical Trials results. For example, in ObservationLogs table string metric value was recorded and GetSuggestion failed. We can't reproduce this situation in advance and handle it in ValidateAlgorithmSettings.

Could you elaborate on this? What do you mean by historical Trials results? Will the controller get to this point if it has already reached max trials or failed before even starting? Is it related to some other algorithm?

But, all things considered, we believe that the validation significantly degrades the UX.
With the validation PR users will:

define experiment: i.e., select the grid algorithm, define a search space and maximum number of trials (as explained, we've seen they are considered somewhat unrelated),
submit experiment,
if validation fails, see that the maximum number of trials is essential to the search space and is in fact used as a minimum number of combinations that may run (source),
count combinations of the search space,
increase max trials or change the search space & go to (4),
submit experiment
see results (success/failure+reason)

With this PR users will:

select the grid algorithm, define a search space and maximum number of trials (as explained, we've seen they are considered somewhat unrelated),
submit experiment,
see results (success/failure+reason)

andreyvelich · 2020-06-15T18:11:33Z

I don't really get this based on our understanding. How can a search space be incorrect/invalid?

For example, for NAS algorithms you can have only certain operations in search space: https://github.com/kubeflow/katib/blob/master/examples/v1beta1/nas/darts-cnn-cifar10/operations.py#L4-L17.
Another example, for Grid you must specify step in double parameters: https://github.com/kubeflow/katib/blob/master/examples/v1alpha3/grid-example.yaml#L26, but for BO it is not necessary (https://github.com/kubeflow/katib/blob/master/examples/v1alpha3/bayesianoptimization-example.yaml#L24-L28).

Maximum number of trials is unrelated to the search space. It refers to a maximum number of jobs, so it defines an upper bound and should have nothing to do with a lower bound. That is, if I declare a maximum of 5 trials, I'm good if my algorithm can produce 3 suggestions. I just need feedback that the experiment is now over.

I believe for some algorithms (hyperband, grid) it is important to specify correct number total jobs or number of parallel jobs, right @gaocegege ? Because of that, we validate it beforehand.

Could you elaborate on this? What do you mean by historical Trials results? Will the controller get to this point if it has already reached max trials or failed before even starting? Is it related to some other algorithm?

As you can see here: https://github.com/kubeflow/katib/blob/master/pkg/suggestion/v1beta1/skopt/base_service.py#L86, we analyse trial results and convert loss value to float. If, for some reason, metrics collector report not float metrics to the DB this casting will fail and GetSuggestion will fail also. Even if metrics were not float, we still send them to Suggestion: https://github.com/kubeflow/katib/blob/master/pkg/controller.v1beta1/suggestion/suggestionclient/suggestionclient.go#L265.
We can't reproduce this situation in advance in ValidateAlgorithmSettings. So we can handle this case with your implementation:

katib/pkg/suggestion/v1alpha3/chocolate/service.py

Lines 48 to 56 in 697790f

    
           try: 
        
               new_assignments = self.base_service.getSuggestions( 
        
                   trials, request.request_number) 
        
           except StopIteration as e: 
        
               return self._set_get_suggestions_context_error( 
        
                   context, grpc.StatusCode.NOT_FOUND, str(e)) 
        
           return api_pb2.GetSuggestionsReply( 
        
               parameter_assignments=Assignment.generate(new_assignments) 
        
           )

and convert Suggestion to Failed/Exhausted state. I prefer Failed state in this situation.

But, all things considered, we believe that the validation significantly degrades the UX.
With the validation PR users will:

define experiment: i.e., select the grid algorithm, define a search space and maximum number of trials (as explained, we've seen they are considered somewhat unrelated),

submit experiment,

if validation fails, see that the maximum number of trials is essential to the search space and is in fact used as a minimum number of combinations that may run (source),

count combinations of the search space,

increase max trials or change the search space & go to (4),

Users don't need to count combination, because we provide them minimal number in Failed Suggestion message: https://github.com/kubeflow/katib/pull/1205/files#diff-c302af83da7afa577c1be59746ddb989R47.

As I said before, I think we should Validate Experiment parameters as many as we can. That can help user to better understand how AutoML algorithms work and how to submit correct Experiment for it. Without deep research of Katib docs and source code.

I agree that we need better gRPC connection between Suggestion and Katib controller.
Right now, if Suggestion fails for some reason, controller will call this suggestion again and again.
I am not sure that is correct approach.

We should think how we can handle this situation, I am fine with approach of set gRPC code, like you did:

katib/pkg/suggestion/v1alpha3/chocolate/service.py

Lines 52 to 53 in 697790f

    
           return self._set_get_suggestions_context_error( 
        
               context, grpc.StatusCode.NOT_FOUND, str(e))

.

But this approach should work for any new algorithm that Katib's contributors want to implement. And that should be very clear when contributors want to use this feature in new algorithm:(https://github.com/kubeflow/katib/blob/master/docs/new-algorithm-service.md).

Any thoughts @johnugeorge @gaocegege ?

gaocegege · 2020-06-16T01:45:01Z

I believe for some algorithms (hyperband, grid) it is important to specify correct number total jobs or number of parallel jobs, right @gaocegege ? Because of that, we validate it beforehand.

Yes, grid search, for example, requires the validation.

aws-kf-ci-bot · 2020-12-17T04:22:26Z

@elikatsis: The following tests failed, say /retest to rerun all failed tests:

Test name	Commit	Details	Rerun command
kubeflow-katib-presubmit-e2e	`697790f`	link	`/test kubeflow-katib-presubmit-e2e`
kubeflow-katib-presubmit	`697790f`	link	`/test kubeflow-katib-presubmit`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

stale · 2021-06-11T01:58:43Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale · 2021-07-08T02:18:53Z

This issue has been automatically closed because it has not had recent activity. Please comment "/reopen" to reopen it.

elikatsis added 7 commits June 9, 2020 18:44

Chocolate: Return gRPC error when generating no new assignments

284cbfc

Signed-off-by: Ilias Katsakioris <elikatsis@arrikto.com>

Mark Suggestion failed if GetSuggestion gRPC fails

0855a16

Signed-off-by: Ilias Katsakioris <elikatsis@arrikto.com>

Experiment: Wait for active trials when suggestion is failed

352a9f1

Signed-off-by: Ilias Katsakioris <elikatsis@arrikto.com>

Suggestion: Introduce Suggestion Exhausted status

80005ab

Signed-off-by: Ilias Katsakioris <elikatsis@arrikto.com>

Suggestion: Mark suggestion as Exhausted if RPC returns NotFound

d5b1d53

Signed-off-by: Ilias Katsakioris <elikatsis@arrikto.com>

Experiment: Use all assignments when Suggestion is failed/exhausted

697790f

When there is no Experiment goal, mark the Experiment as succeeded. Else, if the goal is not reached, mark it as failed.

k8s-ci-robot added the do-not-merge/hold label Jun 9, 2020

k8s-ci-robot requested review from Akado2009 and jinan-zhou June 9, 2020 15:46

k8s-ci-robot added the size/L label Jun 9, 2020

elikatsis mentioned this pull request Jun 9, 2020

Chocolate service db exhausted #1122

Closed

andreyvelich reviewed Jun 10, 2020

View reviewed changes

jlewi mentioned this pull request Nov 17, 2018

Does Katib need Cluster level scope? #162

Closed

stale bot added the lifecycle/stale label Jun 11, 2021

stale bot closed this Jul 8, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extend gRPC controller-algorithm communication & Add Suggestion state Exhausted #1213

Extend gRPC controller-algorithm communication & Add Suggestion state Exhausted #1213

elikatsis commented Jun 9, 2020

k8s-ci-robot commented Jun 9, 2020

kubeflow-bot commented Jun 9, 2020

andreyvelich left a comment

gaocegege commented Jun 10, 2020

johnugeorge commented Jun 10, 2020

elikatsis commented Jun 12, 2020

andreyvelich commented Jun 15, 2020

gaocegege commented Jun 16, 2020

aws-kf-ci-bot commented Dec 17, 2020

stale bot commented Jun 11, 2021

stale bot commented Jul 8, 2021

Extend gRPC controller-algorithm communication & Add Suggestion state Exhausted #1213

Extend gRPC controller-algorithm communication & Add Suggestion state Exhausted #1213

Conversation

elikatsis commented Jun 9, 2020

k8s-ci-robot commented Jun 9, 2020

kubeflow-bot commented Jun 9, 2020

andreyvelich left a comment

Choose a reason for hiding this comment

gaocegege commented Jun 10, 2020

johnugeorge commented Jun 10, 2020

elikatsis commented Jun 12, 2020

andreyvelich commented Jun 15, 2020

gaocegege commented Jun 16, 2020

aws-kf-ci-bot commented Dec 17, 2020

stale bot commented Jun 11, 2021

stale bot commented Jul 8, 2021