-
Notifications
You must be signed in to change notification settings - Fork 448
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add CMA-ES algorithm #67
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: Assign the PR to them by writing The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/assign @gaocegege @YujiOshima |
/ok-to-test Is it WIP or finished? |
@libbyandhelen Please dockerize and add kubernetes manifest. |
Hi, we refactored the structure of the code base, please rebase master :-) |
@YujiOshima OK, will do. |
@YujiOshima |
_ONE_DAY_IN_SECONDS = 60 * 60 * 24 | ||
|
||
|
||
class ManagerService(api_pb2_grpc.ManagerServicer): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you implement a new ManagerService?
You should use this manager .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I am using the manager implemented by you. This one is just for (unit) testing.
Speaking about testing, it seems that I need to follow the getting-start.md
to deploy the katib on k8 clusters. Otherwise, it will stuck on the steps relevant to kubenetes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I understand! Thanks.
pkg/api/python/api.proto
Outdated
@@ -0,0 +1,407 @@ | |||
syntax = "proto3"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should use the existing proto file. https://github.com/kubeflow/katib/blob/master/pkg/api/api.proto
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the current api_pb2.py
and api_pb2_grpc.py
is generated using the original proto file mentioned above. Let me delete the unused one:)
channel = grpc.insecure_channel(MANAGER) | ||
self.stub = api_pb2_grpc.ManagerStub(channel) | ||
|
||
def GetSuggestions(self, request, context): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens when you call GetSuggestion before previous Trials are completed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So if the previous trials are not complete, I assume that when I call GetMetrics
, no metircs_log_sets
will be returned. Then I will return an grpc error saying that all the trials in previous population should be evaluated.
like this:
https://github.com/libbyandhelen/hp-tuning-1/blob/b98dccb5cbc70e85004c848cc7cd7ea403a39577/pkg/suggestion/cma_service.py#L95
Is my assumption correct or not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, GetMetrics
will return currently results when the worker is not completed.
For example, when a worker is not completed and its logs are this.
epoch 1 accuracy=0.1
epoch 2 accuracy=0.3
epoch 3 accuracy=0.4
GetMetrics
will return this.
workerId: example
MetricsLog:
Name: accuracy
Values:
- 0.1
- 0.3
- 0.4
Then, the worker is completed.
epoch 1 accuracy=0.1
epoch 2 accuracy=0.3
epoch 3 accuracy=0.4
epoch 4 accuracy=0.6
epoch 5 accuracy=0.8
test_accuracy=0.7
Completed!!
GetMetrics
will return
workerId: example
MetricsLogs:
Name: accuracy
Values:
- 0.1
- 0.3
- 0.4
- 0.6
- 0.8
Name: test_accuracy
Values:
- 0.7
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@YujiOshima
Then can I check the status of the worker after calling GetWorkers
method to see whether the worker is completed?
From your code, I see that after the client calls RunTrial
, it will create a worker in db (Pending state) and spawn a worker (Running state), but where is code that updates the worker state to completed
after the evaluation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@libbyandhelen When you call GetMetrics
, the status of workers is updated.
So you can get the latest status to call GetWorker
after GetMetrics
.
If you think it troubles, we can add status
item to GetMetricsReply
.
WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@YujiOshima
Thanks! The additional status
item may be more natural for me. Could you please add it?
@YujiOshima |
@libbyandhelen That is a little difficult problem.
Random and grid are the 1st type. Only need request number. They can return parameter lists regardless of the state of other workers. BO is the 2nd type. It needs to wait for the completion of the workers of the past suggestion. It requires 2 kinds of return. Parameter lists and error(the past workers are not completed). HyperBand is the 3rd type. It also needs for the completion of the past workers. And it has an explicit end of the algorithm. It needs three type of return. Parameter lists, error(the past workers are not completed), and the algorithm is completed. There is two way to solve the problem.
It looks simple to handle the type of error. But I don't know it is possible across golang and python or others. |
@YujiOshima |
@libbyandhelen Thanks! It looks more simple to use error codes. |
@YujiOshima |
@@ -39,3 +39,6 @@ cd ${GO_DIR} | |||
|
|||
cp cmd/suggestion/bayesianoptimization/Dockerfile . | |||
gcloud container builds submit . --tag=${REGISTRY}/${REPO_NAME}/suggestion-bayesianoptimization:${VERSION} --project=${PROJECT} | |||
|
|||
cp cmd/suggestion/cma/Dockerfile . |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please create a new script build-suggestion-cma.sh
and register it to /test/workflows/components/workflows.libsonnet
.
@@ -0,0 +1,81 @@ | |||
import grpc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add unit test for CMA to CI (/test/scripts/run-tests.sh) .
@libbyandhelen Could you add e2e test for CMA? |
@libbyandhelen This looks like a great contribution. I know its a lot of work but if you could update the tests and make the other changes it would be great to get this added. |
@libbyandhelen Any plans to try to push this forward? |
/assign @johnugeorge |
/assign @xyhuang |
Hi! Can I work on this issue? |
@c-bata |
This PR is obsolete now as interfaces are no more valid |
Hi, I guess we can close this because #1131 is merged. |
@c-bata Sure. |
@andreyvelich: Closed this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
This is an algorithm called Covariance Matrix Adaptation Evolution Strategy, which can compete or even beat Bayesian algorithm in long run. This is more suitable for the case with continuous parameters and having more budget.
This change is