Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extend gRPC controller-algorithm communication & Add Suggestion state Exhausted #1213

Closed

Conversation

elikatsis
Copy link
Member

This PR targets #1122 and is a first iteration on the proposed solution.

Special notes for your reviewer:

This PR is not ready for merging since tests are not updated accordingly. But I will update/extend tests when we get to a final iteration of this PR.

You will also notice a few commits for the experiment to accept and consume all generated suggestions. This occurs when the controller requests, let's say 3 new suggestions (because it has the room to start another 3 trials) but the algorithm service cannot generate that many.
I can submit it as a separate issue & PR if you like. Currently, the PR contains all the features that are required for our use cases and is mainly here to show you the full picture.

Release note:

TODO: Add release-note


cc @StefanoFioravanzo @andreyvelich @gaocegege

Adding the hold label. Feel free to add another label (e.g., wip) if you feel it suits better.
/hold

Signed-off-by: Ilias Katsakioris <elikatsis@arrikto.com>
Signed-off-by: Ilias Katsakioris <elikatsis@arrikto.com>
Signed-off-by: Ilias Katsakioris <elikatsis@arrikto.com>
If get suggestions RPC fails when receiving less assignments than the
requested amount, we lose assignments and, thus, have less trials to
execute.
With this commit, we log the fact that there is a difference between the
two, however we proceed with experiment execution.

Signed-off-by: Ilias Katsakioris <elikatsis@arrikto.com>
Signed-off-by: Ilias Katsakioris <elikatsis@arrikto.com>
Signed-off-by: Ilias Katsakioris <elikatsis@arrikto.com>
When there is no Experiment goal, mark the Experiment as succeeded.
Else, if the goal is not reached, mark it as failed.
@k8s-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please assign johnugeorge
You can assign the PR to them by writing /assign @johnugeorge in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@kubeflow-bot
Copy link

This change is Reviewable

Copy link
Member

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@elikatsis Thank you for doing this.
I think from now any new features that we implement, we should submit in v1beta1 version.

Also, as I mentioned here: #1122 (comment), can you provide any other use-case when Suggestion can come to Exhausted state not in grid algorithm?

I think it is crucial for data scientists to submit Experiment with the correct search space.
For example: user can define search space with one inappropriate parameter.
Suggestion starts to generate Trials and on x-step Suggestion generate error, like grid did.

Because of that, ValidateAlgorithmSettings is helping user to create correct Experiment in advance.
Also, maybe they want to create a Pipeline of succeeded Experiments and they need to know in advance that all Experiment parameters are correct.

My suggestion is to use Exhausted/Failed state of Suggestion, when GetSuggestion can't produce new assignment because of the historical Trials results. For example, in ObservationLogs table string metric value was recorded and GetSuggestion failed. We can't reproduce this situation in advance and handle it in ValidateAlgorithmSettings.

What do you think @johnugeorge @gaocegege ?

@gaocegege
Copy link
Member

@andreyvelich 's suggestion SGTM.

And I also suggest moving it to v1beta1

@johnugeorge
Copy link
Member

Same thought. Do we have a situation where we cannot validate the invalid configuration?

@elikatsis
Copy link
Member Author

I think from now any new features that we implement, we should submit in v1beta1 version.

Ack! I can transfer the changes. It's that way because some commits were made before moving to v1beta1.

Also, as I mentioned here: #1122 (comment), can you provide any other use-case when Suggestion can come to Exhausted state not in grid algorithm?

I don't have insight on the algorithms, but I think what other algorithms do is unrelated to the bigger picture. Our argument is that other algorithms could benefit from such an approach, because it essentially enables actual gRPC communication between the controller and the service, for whatever reason, and in the same time being backwards compatible.

I think it is crucial for data scientists to submit Experiment with the correct search space.
For example: user can define search space with one inappropriate parameter.
Suggestion starts to generate Trials and on x-step Suggestion generate error, like grid did.

Because of that, ValidateAlgorithmSettings is helping user to create correct Experiment in advance.
Also, maybe they want to create a Pipeline of succeeded Experiments and they need to know in advance that all Experiment parameters are correct.

I don't really get this based on our understanding. How can a search space be incorrect/invalid?
Maximum number of trials is unrelated to the search space. It refers to a maximum number of jobs, so it defines an upper bound and should have nothing to do with a lower bound.
That is, if I declare a maximum of 5 trials, I'm good if my algorithm can produce 3 suggestions. I just need feedback that the experiment is now over.
As users of Katib, that's what we've seen: a data scientist thinks of max trials as a number from which point they don't want to keep running jobs. If what they want to do is not satisfied (goal or no-goal) they iterate and re-submit. They don't really correlate it with the possible combinations.

My suggestion is to use Exhausted/Failed state of Suggestion, when GetSuggestion can't produce new assignment because of the historical Trials results. For example, in ObservationLogs table string metric value was recorded and GetSuggestion failed. We can't reproduce this situation in advance and handle it in ValidateAlgorithmSettings.

Could you elaborate on this? What do you mean by historical Trials results? Will the controller get to this point if it has already reached max trials or failed before even starting? Is it related to some other algorithm?

But, all things considered, we believe that the validation significantly degrades the UX.
With the validation PR users will:

  1. define experiment: i.e., select the grid algorithm, define a search space and maximum number of trials (as explained, we've seen they are considered somewhat unrelated),
  2. submit experiment,
  3. if validation fails, see that the maximum number of trials is essential to the search space and is in fact used as a minimum number of combinations that may run (source),
  4. count combinations of the search space,
  5. increase max trials or change the search space & go to (4),
  6. submit experiment
  7. see results (success/failure+reason)

With this PR users will:

  1. select the grid algorithm, define a search space and maximum number of trials (as explained, we've seen they are considered somewhat unrelated),
  2. submit experiment,
  3. see results (success/failure+reason)

@andreyvelich
Copy link
Member

I don't really get this based on our understanding. How can a search space be incorrect/invalid?

For example, for NAS algorithms you can have only certain operations in search space: https://github.com/kubeflow/katib/blob/master/examples/v1beta1/nas/darts-cnn-cifar10/operations.py#L4-L17.
Another example, for Grid you must specify step in double parameters: https://github.com/kubeflow/katib/blob/master/examples/v1alpha3/grid-example.yaml#L26, but for BO it is not necessary (https://github.com/kubeflow/katib/blob/master/examples/v1alpha3/bayesianoptimization-example.yaml#L24-L28).

Maximum number of trials is unrelated to the search space. It refers to a maximum number of jobs, so it defines an upper bound and should have nothing to do with a lower bound. That is, if I declare a maximum of 5 trials, I'm good if my algorithm can produce 3 suggestions. I just need feedback that the experiment is now over.

I believe for some algorithms (hyperband, grid) it is important to specify correct number total jobs or number of parallel jobs, right @gaocegege ? Because of that, we validate it beforehand.

Could you elaborate on this? What do you mean by historical Trials results? Will the controller get to this point if it has already reached max trials or failed before even starting? Is it related to some other algorithm?

As you can see here: https://github.com/kubeflow/katib/blob/master/pkg/suggestion/v1beta1/skopt/base_service.py#L86, we analyse trial results and convert loss value to float. If, for some reason, metrics collector report not float metrics to the DB this casting will fail and GetSuggestion will fail also. Even if metrics were not float, we still send them to Suggestion: https://github.com/kubeflow/katib/blob/master/pkg/controller.v1beta1/suggestion/suggestionclient/suggestionclient.go#L265.
We can't reproduce this situation in advance in ValidateAlgorithmSettings. So we can handle this case with your implementation:

try:
new_assignments = self.base_service.getSuggestions(
trials, request.request_number)
except StopIteration as e:
return self._set_get_suggestions_context_error(
context, grpc.StatusCode.NOT_FOUND, str(e))
return api_pb2.GetSuggestionsReply(
parameter_assignments=Assignment.generate(new_assignments)
)
and convert Suggestion to Failed/Exhausted state. I prefer Failed state in this situation.

But, all things considered, we believe that the validation significantly degrades the UX.
With the validation PR users will:

  1. define experiment: i.e., select the grid algorithm, define a search space and maximum number of trials (as explained, we've seen they are considered somewhat unrelated),
  2. submit experiment,
  3. if validation fails, see that the maximum number of trials is essential to the search space and is in fact used as a minimum number of combinations that may run (source),
  4. count combinations of the search space,
  5. increase max trials or change the search space & go to (4),

Users don't need to count combination, because we provide them minimal number in Failed Suggestion message: https://github.com/kubeflow/katib/pull/1205/files#diff-c302af83da7afa577c1be59746ddb989R47.

As I said before, I think we should Validate Experiment parameters as many as we can. That can help user to better understand how AutoML algorithms work and how to submit correct Experiment for it. Without deep research of Katib docs and source code.

I agree that we need better gRPC connection between Suggestion and Katib controller.
Right now, if Suggestion fails for some reason, controller will call this suggestion again and again.
I am not sure that is correct approach.

We should think how we can handle this situation, I am fine with approach of set gRPC code, like you did:

return self._set_get_suggestions_context_error(
context, grpc.StatusCode.NOT_FOUND, str(e))
.

But this approach should work for any new algorithm that Katib's contributors want to implement. And that should be very clear when contributors want to use this feature in new algorithm:(https://github.com/kubeflow/katib/blob/master/docs/new-algorithm-service.md).

Any thoughts @johnugeorge @gaocegege ?

@gaocegege
Copy link
Member

I believe for some algorithms (hyperband, grid) it is important to specify correct number total jobs or number of parallel jobs, right @gaocegege ? Because of that, we validate it beforehand.

Yes, grid search, for example, requires the validation.

@aws-kf-ci-bot
Copy link
Contributor

@elikatsis: The following tests failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
kubeflow-katib-presubmit-e2e 697790f link /test kubeflow-katib-presubmit-e2e
kubeflow-katib-presubmit 697790f link /test kubeflow-katib-presubmit

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@stale
Copy link

stale bot commented Jun 11, 2021

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale
Copy link

stale bot commented Jul 8, 2021

This issue has been automatically closed because it has not had recent activity. Please comment "/reopen" to reopen it.

@stale stale bot closed this Jul 8, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants