Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add CMA-ES algorithm #67

Closed
wants to merge 32 commits into from
Closed

Conversation

libbyandhelen
Copy link
Contributor

@libbyandhelen libbyandhelen commented Apr 20, 2018

This is an algorithm called Covariance Matrix Adaptation Evolution Strategy, which can compete or even beat Bayesian algorithm in long run. This is more suitable for the case with continuous parameters and having more budget.


This change is Reviewable

@k8s-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To fully approve this pull request, please assign additional approvers.
We suggest the following additional approver: gaocegege

Assign the PR to them by writing /assign @gaocegege in a comment when ready.

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@libbyandhelen
Copy link
Contributor Author

/assign @gaocegege @YujiOshima

@gaocegege
Copy link
Member

gaocegege commented Apr 22, 2018

/ok-to-test

Is it WIP or finished?

@YujiOshima
Copy link
Contributor

@libbyandhelen Please dockerize and add kubernetes manifest.
If possible, please add unit-test for this suggestion to test script.

@gaocegege
Copy link
Member

Hi, we refactored the structure of the code base, please rebase master :-)

@libbyandhelen
Copy link
Contributor Author

@YujiOshima OK, will do.
@gaocegege This is still in progress. Will remove the "WIP" in title when finished

@libbyandhelen
Copy link
Contributor Author

@YujiOshima
please take a look:)

_ONE_DAY_IN_SECONDS = 60 * 60 * 24


class ManagerService(api_pb2_grpc.ManagerServicer):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you implement a new ManagerService?
You should use this manager .

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I am using the manager implemented by you. This one is just for (unit) testing.
Speaking about testing, it seems that I need to follow the getting-start.md to deploy the katib on k8 clusters. Otherwise, it will stuck on the steps relevant to kubenetes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I understand! Thanks.

@@ -0,0 +1,407 @@
syntax = "proto3";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the current api_pb2.py and api_pb2_grpc.py is generated using the original proto file mentioned above. Let me delete the unused one:)

channel = grpc.insecure_channel(MANAGER)
self.stub = api_pb2_grpc.ManagerStub(channel)

def GetSuggestions(self, request, context):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens when you call GetSuggestion before previous Trials are completed.

Copy link
Contributor Author

@libbyandhelen libbyandhelen May 30, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if the previous trials are not complete, I assume that when I call GetMetrics, no metircs_log_sets will be returned. Then I will return an grpc error saying that all the trials in previous population should be evaluated.
like this:
https://github.com/libbyandhelen/hp-tuning-1/blob/b98dccb5cbc70e85004c848cc7cd7ea403a39577/pkg/suggestion/cma_service.py#L95
Is my assumption correct or not?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, GetMetrics will return currently results when the worker is not completed.
For example, when a worker is not completed and its logs are this.

epoch 1 accuracy=0.1
epoch 2 accuracy=0.3
epoch 3 accuracy=0.4

GetMetrics will return this.

workerId: example
MetricsLog:
    Name: accuracy
    Values:
        - 0.1
        - 0.3
        - 0.4

Then, the worker is completed.

epoch 1 accuracy=0.1
epoch 2 accuracy=0.3
epoch 3 accuracy=0.4
epoch 4 accuracy=0.6
epoch 5 accuracy=0.8
test_accuracy=0.7
Completed!!

GetMetrics will return

workerId: example
MetricsLogs:
    Name: accuracy
    Values:
        - 0.1
        - 0.3
        - 0.4
        - 0.6
        - 0.8
    Name: test_accuracy
    Values:
        - 0.7

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@YujiOshima
Then can I check the status of the worker after calling GetWorkers method to see whether the worker is completed?
From your code, I see that after the client calls RunTrial, it will create a worker in db (Pending state) and spawn a worker (Running state), but where is code that updates the worker state to completed after the evaluation?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@libbyandhelen When you call GetMetrics, the status of workers is updated.
So you can get the latest status to call GetWorker after GetMetrics.
If you think it troubles, we can add status item to GetMetricsReply.
WDYT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@YujiOshima
Thanks! The additional status item may be more natural for me. Could you please add it?

@libbyandhelen
Copy link
Contributor Author

@YujiOshima
So now after I call GetMetrics method, I will loop through all the metrics_log_set. And if any worker_status in it is not completed, a grpc error is thrown, so that to ensure all the workers are finished before the final objective value is calculated.
WDYT?

@YujiOshima
Copy link
Contributor

YujiOshima commented Jun 14, 2018

@libbyandhelen That is a little difficult problem.
I think there are three types for the suggestion service.

  • Not depend on the past results
  • Depend on the past results and has no explicit end of the suggestion.
  • Depend on the past results and has an explicit end.

Random and grid are the 1st type. Only need request number. They can return parameter lists regardless of the state of other workers.

BO is the 2nd type. It needs to wait for the completion of the workers of the past suggestion. It requires 2 kinds of return. Parameter lists and error(the past workers are not completed).

HyperBand is the 3rd type. It also needs for the completion of the past workers. And it has an explicit end of the algorithm. It needs three type of return. Parameter lists, error(the past workers are not completed), and the algorithm is completed.

There is two way to solve the problem.

  • Handling a type of Error.

    • Param lists and no error
    • No param lists and error(Wating)
    • No param lists, and error(Completed)
  • Introduce completed flag to the GetSuggestionsReply

    • Param lists, no error, and completed:false
    • No param lists, error and completed:false
    • No param lists, no error, and completed:true

It looks simple to handle the type of error. But I don't know it is possible across golang and python or others.
WDYT?

@libbyandhelen
Copy link
Contributor Author

@YujiOshima
I am not sure whether I get your question right, but as for the type of error, I think we can use the error message and error code in gRPC, which should be portable across different languages.
http://avi.im/grpc-errors/
But I think using GetSuggestionsReply is also nice:))

@YujiOshima
Copy link
Contributor

@libbyandhelen Thanks! It looks more simple to use error codes.
Let's use FailedPrecondition Code = 9 error code when any worker_status in it is not completed.

@libbyandhelen
Copy link
Contributor Author

@YujiOshima
How about now? Is this OK?

@@ -39,3 +39,6 @@ cd ${GO_DIR}

cp cmd/suggestion/bayesianoptimization/Dockerfile .
gcloud container builds submit . --tag=${REGISTRY}/${REPO_NAME}/suggestion-bayesianoptimization:${VERSION} --project=${PROJECT}

cp cmd/suggestion/cma/Dockerfile .
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please create a new script build-suggestion-cma.sh and register it to /test/workflows/components/workflows.libsonnet.

@@ -0,0 +1,81 @@
import grpc
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add unit test for CMA to CI (/test/scripts/run-tests.sh) .

@YujiOshima
Copy link
Contributor

@libbyandhelen Could you add e2e test for CMA?
In e2e test, we use /test/e2e/test-client.go to get suggestion and run trial test.

@jlewi
Copy link
Contributor

jlewi commented Jul 24, 2018

@libbyandhelen This looks like a great contribution. I know its a lot of work but if you could update the tests and make the other changes it would be great to get this added.

@jlewi
Copy link
Contributor

jlewi commented Oct 9, 2018

@libbyandhelen Any plans to try to push this forward?

@ddutta
Copy link
Member

ddutta commented Nov 26, 2018

/assign @johnugeorge

@ddutta
Copy link
Member

ddutta commented Nov 26, 2018

/assign @xyhuang

@c-bata
Copy link
Member

c-bata commented Mar 23, 2020

Hi! Can I work on this issue?
#1100

@johnugeorge
Copy link
Member

@c-bata
Thanks for the interest. That would be great.

@johnugeorge
Copy link
Member

This PR is obsolete now as interfaces are no more valid

@c-bata
Copy link
Member

c-bata commented Apr 16, 2020

Hi, I guess we can close this because #1131 is merged.

@andreyvelich
Copy link
Member

@c-bata Sure.
/close

@k8s-ci-robot
Copy link

@andreyvelich: Closed this PR.

In response to this:

@c-bata Sure.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants