-
Notifications
You must be signed in to change notification settings - Fork 442
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extend gRPC controller-algorithm communication & Add Suggestion state Exhausted #1213
Conversation
Signed-off-by: Ilias Katsakioris <elikatsis@arrikto.com>
Signed-off-by: Ilias Katsakioris <elikatsis@arrikto.com>
Signed-off-by: Ilias Katsakioris <elikatsis@arrikto.com>
If get suggestions RPC fails when receiving less assignments than the requested amount, we lose assignments and, thus, have less trials to execute. With this commit, we log the fact that there is a difference between the two, however we proceed with experiment execution. Signed-off-by: Ilias Katsakioris <elikatsis@arrikto.com>
Signed-off-by: Ilias Katsakioris <elikatsis@arrikto.com>
Signed-off-by: Ilias Katsakioris <elikatsis@arrikto.com>
When there is no Experiment goal, mark the Experiment as succeeded. Else, if the goal is not reached, mark it as failed.
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@elikatsis Thank you for doing this.
I think from now any new features that we implement, we should submit in v1beta1 version.
Also, as I mentioned here: #1122 (comment), can you provide any other use-case when Suggestion can come to Exhausted
state not in grid
algorithm?
I think it is crucial for data scientists to submit Experiment with the correct search space.
For example: user can define search space with one inappropriate parameter.
Suggestion starts to generate Trials and on x-step Suggestion generate error, like grid
did.
Because of that, ValidateAlgorithmSettings
is helping user to create correct Experiment in advance.
Also, maybe they want to create a Pipeline of succeeded Experiments and they need to know in advance that all Experiment parameters are correct.
My suggestion is to use Exhausted/Failed state of Suggestion, when GetSuggestion
can't produce new assignment because of the historical Trials results. For example, in ObservationLogs
table string
metric value was recorded and GetSuggestion
failed. We can't reproduce this situation in advance and handle it in ValidateAlgorithmSettings
.
What do you think @johnugeorge @gaocegege ?
@andreyvelich 's suggestion SGTM. And I also suggest moving it to v1beta1 |
Same thought. Do we have a situation where we cannot validate the invalid configuration? |
Ack! I can transfer the changes. It's that way because some commits were made before moving to v1beta1.
I don't have insight on the algorithms, but I think what other algorithms do is unrelated to the bigger picture. Our argument is that other algorithms could benefit from such an approach, because it essentially enables actual gRPC communication between the controller and the service, for whatever reason, and in the same time being backwards compatible.
I don't really get this based on our understanding. How can a search space be incorrect/invalid?
Could you elaborate on this? What do you mean by historical Trials results? Will the controller get to this point if it has already reached max trials or failed before even starting? Is it related to some other algorithm? But, all things considered, we believe that the validation significantly degrades the UX.
With this PR users will:
|
For example, for NAS algorithms you can have only certain operations in search space: https://github.com/kubeflow/katib/blob/master/examples/v1beta1/nas/darts-cnn-cifar10/operations.py#L4-L17.
I believe for some algorithms (hyperband, grid) it is important to specify correct number total jobs or number of parallel jobs, right @gaocegege ? Because of that, we validate it beforehand.
As you can see here: https://github.com/kubeflow/katib/blob/master/pkg/suggestion/v1beta1/skopt/base_service.py#L86, we analyse trial results and convert katib/pkg/suggestion/v1alpha3/chocolate/service.py Lines 48 to 56 in 697790f
Failed/Exhausted state. I prefer Failed state in this situation.
Users don't need to count combination, because we provide them minimal number in Failed Suggestion message: https://github.com/kubeflow/katib/pull/1205/files#diff-c302af83da7afa577c1be59746ddb989R47. As I said before, I think we should Validate Experiment parameters as many as we can. That can help user to better understand how AutoML algorithms work and how to submit correct Experiment for it. Without deep research of Katib docs and source code. I agree that we need better gRPC connection between Suggestion and Katib controller. We should think how we can handle this situation, I am fine with approach of set gRPC code, like you did: katib/pkg/suggestion/v1alpha3/chocolate/service.py Lines 52 to 53 in 697790f
But this approach should work for any new algorithm that Katib's contributors want to implement. And that should be very clear when contributors want to use this feature in new algorithm:(https://github.com/kubeflow/katib/blob/master/docs/new-algorithm-service.md). Any thoughts @johnugeorge @gaocegege ? |
Yes, grid search, for example, requires the validation. |
@elikatsis: The following tests failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
This issue has been automatically closed because it has not had recent activity. Please comment "/reopen" to reopen it. |
This PR targets #1122 and is a first iteration on the proposed solution.
Special notes for your reviewer:
This PR is not ready for merging since tests are not updated accordingly. But I will update/extend tests when we get to a final iteration of this PR.
You will also notice a few commits for the experiment to accept and consume all generated suggestions. This occurs when the controller requests, let's say 3 new suggestions (because it has the room to start another 3 trials) but the algorithm service cannot generate that many.
I can submit it as a separate issue & PR if you like. Currently, the PR contains all the features that are required for our use cases and is mainly here to show you the full picture.
Release note:
TODO: Add release-note
cc @StefanoFioravanzo @andreyvelich @gaocegege
Adding the
hold
label. Feel free to add another label (e.g.,wip
) if you feel it suits better./hold