add CMA-ES algorithm #67

libbyandhelen · 2018-04-20T22:53:45Z

This is an algorithm called Covariance Matrix Adaptation Evolution Strategy, which can compete or even beat Bayesian algorithm in long run. This is more suitable for the case with continuous parameters and having more budget.

This change is

k8s-ci-robot · 2018-04-20T22:53:47Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To fully approve this pull request, please assign additional approvers.
We suggest the following additional approver: gaocegege

Assign the PR to them by writing /assign @gaocegege in a comment when ready.

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

libbyandhelen · 2018-04-20T22:56:46Z

/assign @gaocegege @YujiOshima

gaocegege · 2018-04-22T03:10:15Z

/ok-to-test

Is it WIP or finished?

YujiOshima · 2018-04-22T04:27:14Z

@libbyandhelen Please dockerize and add kubernetes manifest.
If possible, please add unit-test for this suggestion to test script.

gaocegege · 2018-04-22T06:01:39Z

Hi, we refactored the structure of the code base, please rebase master :-)

libbyandhelen · 2018-04-23T01:22:40Z

@YujiOshima OK, will do.
@gaocegege This is still in progress. Will remove the "WIP" in title when finished

libbyandhelen · 2018-05-25T04:03:35Z

@YujiOshima
please take a look:)

YujiOshima · 2018-05-29T10:57:55Z

pkg/suggestion/cma_unit_test/main.py

+_ONE_DAY_IN_SECONDS = 60 * 60 * 24
+
+
+class ManagerService(api_pb2_grpc.ManagerServicer):


Why do you implement a new ManagerService?
You should use this manager .

Yes, I am using the manager implemented by you. This one is just for (unit) testing.
Speaking about testing, it seems that I need to follow the getting-start.md to deploy the katib on k8 clusters. Otherwise, it will stuck on the steps relevant to kubenetes.

OK, I understand! Thanks.

YujiOshima · 2018-05-29T11:00:19Z

pkg/api/python/api.proto

@@ -0,0 +1,407 @@
+syntax = "proto3";


You should use the existing proto file. https://github.com/kubeflow/katib/blob/master/pkg/api/api.proto

Yes, the current api_pb2.py and api_pb2_grpc.py is generated using the original proto file mentioned above. Let me delete the unused one:)

YujiOshima · 2018-05-29T11:03:07Z

pkg/suggestion/cma_service.py

+        channel = grpc.insecure_channel(MANAGER)
+        self.stub = api_pb2_grpc.ManagerStub(channel)
+
+    def GetSuggestions(self, request, context):


What happens when you call GetSuggestion before previous Trials are completed.

So if the previous trials are not complete, I assume that when I call GetMetrics, no metircs_log_sets will be returned. Then I will return an grpc error saying that all the trials in previous population should be evaluated.
like this:
https://github.com/libbyandhelen/hp-tuning-1/blob/b98dccb5cbc70e85004c848cc7cd7ea403a39577/pkg/suggestion/cma_service.py#L95
Is my assumption correct or not?

No, GetMetrics will return currently results when the worker is not completed.
For example, when a worker is not completed and its logs are this.

epoch 1 accuracy=0.1 epoch 2 accuracy=0.3 epoch 3 accuracy=0.4

GetMetrics will return this.

workerId: example MetricsLog: Name: accuracy Values: - 0.1 - 0.3 - 0.4

Then, the worker is completed.

epoch 1 accuracy=0.1 epoch 2 accuracy=0.3 epoch 3 accuracy=0.4 epoch 4 accuracy=0.6 epoch 5 accuracy=0.8 test_accuracy=0.7 Completed!!

GetMetrics will return

workerId: example MetricsLogs: Name: accuracy Values: - 0.1 - 0.3 - 0.4 - 0.6 - 0.8 Name: test_accuracy Values: - 0.7

@YujiOshima
Then can I check the status of the worker after calling GetWorkers method to see whether the worker is completed?
From your code, I see that after the client calls RunTrial, it will create a worker in db (Pending state) and spawn a worker (Running state), but where is code that updates the worker state to completed after the evaluation?

@libbyandhelen When you call GetMetrics, the status of workers is updated.
So you can get the latest status to call GetWorker after GetMetrics.
If you think it troubles, we can add status item to GetMetricsReply.
WDYT?

@YujiOshima
Thanks! The additional status item may be more natural for me. Could you please add it?

libbyandhelen · 2018-06-13T17:22:59Z

@YujiOshima
So now after I call GetMetrics method, I will loop through all the metrics_log_set. And if any worker_status in it is not completed, a grpc error is thrown, so that to ensure all the workers are finished before the final objective value is calculated.
WDYT?

YujiOshima · 2018-06-14T07:11:38Z

@libbyandhelen That is a little difficult problem.
I think there are three types for the suggestion service.

Not depend on the past results
Depend on the past results and has no explicit end of the suggestion.
Depend on the past results and has an explicit end.

Random and grid are the 1st type. Only need request number. They can return parameter lists regardless of the state of other workers.

BO is the 2nd type. It needs to wait for the completion of the workers of the past suggestion. It requires 2 kinds of return. Parameter lists and error(the past workers are not completed).

HyperBand is the 3rd type. It also needs for the completion of the past workers. And it has an explicit end of the algorithm. It needs three type of return. Parameter lists, error(the past workers are not completed), and the algorithm is completed.

There is two way to solve the problem.

Handling a type of Error.
- Param lists and no error
- No param lists and error(Wating)
- No param lists, and error(Completed)
Introduce completed flag to the GetSuggestionsReply
- Param lists, no error, and completed:false
- No param lists, error and completed:false
- No param lists, no error, and completed:true

It looks simple to handle the type of error. But I don't know it is possible across golang and python or others.
WDYT?

libbyandhelen · 2018-06-15T07:02:26Z

@YujiOshima
I am not sure whether I get your question right, but as for the type of error, I think we can use the error message and error code in gRPC, which should be portable across different languages.
http://avi.im/grpc-errors/
But I think using GetSuggestionsReply is also nice:))

YujiOshima · 2018-06-18T05:26:35Z

@libbyandhelen Thanks! It looks more simple to use error codes.
Let's use FailedPrecondition Code = 9 error code when any worker_status in it is not completed.

libbyandhelen · 2018-06-25T16:35:59Z

@YujiOshima
How about now? Is this OK?

YujiOshima · 2018-06-26T01:14:32Z

test/scripts/build-suggestion-bo.sh

@@ -39,3 +39,6 @@ cd ${GO_DIR}

 cp cmd/suggestion/bayesianoptimization/Dockerfile .
 gcloud container builds submit . --tag=${REGISTRY}/${REPO_NAME}/suggestion-bayesianoptimization:${VERSION} --project=${PROJECT}
+
+cp cmd/suggestion/cma/Dockerfile .


Please create a new script build-suggestion-cma.sh and register it to /test/workflows/components/workflows.libsonnet.

YujiOshima · 2018-06-26T01:17:04Z

pkg/suggestion/test_cma_client.py

@@ -0,0 +1,81 @@
+import grpc


Please add unit test for CMA to CI (/test/scripts/run-tests.sh) .

YujiOshima · 2018-06-26T01:18:35Z

@libbyandhelen Could you add e2e test for CMA?
In e2e test, we use /test/e2e/test-client.go to get suggestion and run trial test.

jlewi · 2018-07-24T05:16:39Z

@libbyandhelen This looks like a great contribution. I know its a lot of work but if you could update the tests and make the other changes it would be great to get this added.

jlewi · 2018-10-09T12:54:52Z

@libbyandhelen Any plans to try to push this forward?

ddutta · 2018-11-26T15:22:21Z

/assign @johnugeorge

ddutta · 2018-11-26T15:22:35Z

/assign @xyhuang

c-bata · 2020-03-23T08:43:14Z

Hi! Can I work on this issue?
#1100

johnugeorge · 2020-03-23T15:16:04Z

@c-bata
Thanks for the interest. That would be great.

johnugeorge · 2020-03-23T15:17:15Z

This PR is obsolete now as interfaces are no more valid

c-bata · 2020-04-16T17:15:53Z

Hi, I guess we can close this because #1131 is merged.

andreyvelich · 2020-04-16T17:18:20Z

@c-bata Sure.
/close

k8s-ci-robot · 2020-04-16T17:18:25Z

@andreyvelich: Closed this PR.

In response to this:

@c-bata Sure.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

libbyandhelen added 9 commits April 12, 2018 09:55

bayesian optimization draft

aeb2848

add matern kernel and two other acquisition functions

58ce050

modify the code structure and add Dockerfile and k8s manifest

b9e7d36

fix typos in manifest files

ca5e891

Merge remote-tracking branch 'upstream/master'

357ee90

add build and deploy test

3d1838a

fix errors in run-tests

4afee6f

fix merge conflict

ea97f63

add cma-es algorithm (wip)

b352d34

k8s-ci-robot added the do-not-merge/work-in-progress label Apr 20, 2018

k8s-ci-robot requested review from ddysher and jose5918 April 20, 2018 22:53

k8s-ci-robot added needs-ok-to-test size/XL labels Apr 20, 2018

k8s-ci-robot assigned gaocegege and YujiOshima Apr 20, 2018

k8s-ci-robot removed the needs-ok-to-test label Apr 22, 2018

libbyandhelen added 7 commits April 23, 2018 14:21

Merge remote-tracking branch 'upstream/master'

68498d6

add dockerfile and manifests, and clean up the code

06ce93f

add dockerfile and manifests, and clean up the code

1660202

refactor

1465850

refactor in dockerfile

5015df9

Merge remote-tracking branch 'upstream/master'

93a3341

add unit test for cma

40acf5a

enable cma test

b98dccb

YujiOshima reviewed May 29, 2018

View reviewed changes

libbyandhelen added 2 commits May 30, 2018 10:38

Merge remote-tracking branch 'upstream/master'

a070881

delete unused proto file

e08686d

YujiOshima mentioned this pull request Jun 8, 2018

API: Add WorkerStatus to GetMetrics and remove unused items #110

Merged

libbyandhelen added 2 commits June 13, 2018 10:12

fix conflict

4d3faa0

use worker_state in metrics_log_set

48f39d9

add error code

19aa77a

YujiOshima reviewed Jun 26, 2018

View reviewed changes

k8s-ci-robot assigned johnugeorge Nov 26, 2018

k8s-ci-robot assigned xyhuang Nov 26, 2018

c-bata mentioned this pull request Mar 23, 2020

Add CMA-ES based suggestion service. #1100

Closed

k8s-ci-robot closed this Apr 16, 2020

		_ONE_DAY_IN_SECONDS = 60 * 60 * 24


		class ManagerService(api_pb2_grpc.ManagerServicer):

add CMA-ES algorithm #67

add CMA-ES algorithm #67

Conversation

libbyandhelen commented Apr 20, 2018 • edited by jlewi Loading

k8s-ci-robot commented Apr 20, 2018

libbyandhelen commented Apr 20, 2018

gaocegege commented Apr 22, 2018 • edited Loading

YujiOshima commented Apr 22, 2018

gaocegege commented Apr 22, 2018

libbyandhelen commented Apr 23, 2018

libbyandhelen commented May 25, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

libbyandhelen May 30, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

libbyandhelen commented Jun 13, 2018

YujiOshima commented Jun 14, 2018 • edited Loading

libbyandhelen commented Jun 15, 2018

YujiOshima commented Jun 18, 2018

libbyandhelen commented Jun 25, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

YujiOshima commented Jun 26, 2018

jlewi commented Jul 24, 2018

jlewi commented Oct 9, 2018

ddutta commented Nov 26, 2018

ddutta commented Nov 26, 2018

c-bata commented Mar 23, 2020

johnugeorge commented Mar 23, 2020

johnugeorge commented Mar 23, 2020

c-bata commented Apr 16, 2020

andreyvelich commented Apr 16, 2020

k8s-ci-robot commented Apr 16, 2020

libbyandhelen commented Apr 20, 2018 •

edited by jlewi

Loading

gaocegege commented Apr 22, 2018 •

edited

Loading

libbyandhelen May 30, 2018 •

edited

Loading

YujiOshima commented Jun 14, 2018 •

edited

Loading