[enh]: validate for bayesian optimization algorithm settings #1600

anencore94 · 2021-07-31T04:38:03Z

What this PR does / why we need it:

support validating for bayesianoptimization algorithm settings (skopt)
- validation criteria were referenced by https://scikit-optimize.github.io/stable/modules/generated/skopt.Optimizer.html
we need it since for fast fail since user could mistake

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
fixes part of #1126

Checklist:

Docs included if any changes are user facing

How I Test

we could check with unit-test code
Also, I've checked in my katib cluster with new skopt image with this yaml

apiVersion: kubeflow.org/v1beta1
kind: Experiment
metadata:
  name: random-example-3
  namespace: kubeflow
spec:
  algorithm:
    algorithmName: bayesianoptimization
    algorithmSettings:
      - name: "unknown"
        value: "10"
  maxFailedTrialCount: 1
  maxTrialCount: 1
  parallelTrialCount: 1
  metricsCollectorSpec:
    collector:
      kind: StdOut
  objective:
    additionalMetricNames:
    - Train-accuracy
    goal: 0.99
    metricStrategies:
    - name: Validation-accuracy
      value: max
    - name: Train-accuracy
      value: max
    objectiveMetricName: Validation-accuracy
    type: maximize
  parameters:
  - feasibleSpace:
      max: "0.03"
      min: "0.01"
    name: lr
    parameterType: double
  - feasibleSpace:
      max: "2"
      min: "1"
    name: num-layers
    parameterType: int
  - feasibleSpace:
      list:
      - sgd
      - adam
      - ftrl
    name: optimizer
    parameterType: categorical
  resumePolicy: LongRunning
  trialTemplate:
    failureCondition: status.conditions.#(type=="Failed")#|#(status=="True")#
    primaryContainerName: training-container
    successCondition: status.conditions.#(type=="Complete")#|#(status=="True")#
    trialParameters:
    - description: Learning rate for the training model
      name: learningRate
      reference: lr
    - description: Number of training model layers
      name: numberLayers
      reference: num-layers
    - description: Training model optimizer (sdg, adam or ftrl)
      name: optimizer
      reference: optimizer
    trialSpec:
      apiVersion: batch/v1
      kind: Job
      spec:
        template:
          spec:
            containers:
            - command:
              - python3
              - /opt/mxnet-mnist/mnist.py
              - --batch-size=256
              - --lr=${trialParameters.learningRate}
              - --num-layers=${trialParameters.numberLayers}
              - --optimizer=${trialParameters.optimizer}
              image: docker.io/kubeflowkatib/mxnet-mnist:v1beta1-e294a90
              name: training-container
            restartPolicy: Never

And expected error msg was printed
However, I'm afraid even if suggestion and experiment failed, corresponding deployment/pod stays in running... I'm not sure why does it happens, but I guess it is a bug and should be handled with another PR.

aws-kf-ci-bot · 2021-07-31T04:38:14Z

Hi @anencore94. Thanks for your PR.

I'm waiting for a kubeflow member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

anencore94 · 2021-07-31T04:41:57Z

Here is another test case for sure with another wrong algorithm setting yaml

apiVersion: kubeflow.org/v1beta1
kind: Experiment
metadata:
  name: random-example-4
  namespace: kubeflow
spec:
  algorithm:
    algorithmName: bayesianoptimization
    algorithmSettings:
      - name: "random_state"
        value: "-1"
  maxFailedTrialCount: 1
  maxTrialCount: 1
  parallelTrialCount: 1
  metricsCollectorSpec:
    collector:
      kind: StdOut
  objective:
    additionalMetricNames:
    - Train-accuracy
    goal: 0.99
    metricStrategies:
    - name: Validation-accuracy
      value: max
    - name: Train-accuracy
      value: max
    objectiveMetricName: Validation-accuracy
    type: maximize
  parameters:
  - feasibleSpace:
      max: "0.03"
      min: "0.01"
    name: lr
    parameterType: double
  - feasibleSpace:
      max: "2"
      min: "1"
    name: num-layers
    parameterType: int
  - feasibleSpace:
      list:
      - sgd
      - adam
      - ftrl
    name: optimizer
    parameterType: categorical
  resumePolicy: LongRunning
  trialTemplate:
    failureCondition: status.conditions.#(type=="Failed")#|#(status=="True")#
    primaryContainerName: training-container
    successCondition: status.conditions.#(type=="Complete")#|#(status=="True")#
    trialParameters:
    - description: Learning rate for the training model
      name: learningRate
      reference: lr
    - description: Number of training model layers
      name: numberLayers
      reference: num-layers
    - description: Training model optimizer (sdg, adam or ftrl)
      name: optimizer
      reference: optimizer
    trialSpec:
      apiVersion: batch/v1
      kind: Job
      spec:
        template:
          spec:
            containers:
            - command:
              - python3
              - /opt/mxnet-mnist/mnist.py
              - --batch-size=256
              - --lr=${trialParameters.learningRate}
              - --num-layers=${trialParameters.numberLayers}
              - --optimizer=${trialParameters.optimizer}
              image: docker.io/kubeflowkatib/mxnet-mnist:v1beta1-e294a90
              name: training-container
            restartPolicy: Never

gaocegege · 2021-08-01T13:44:06Z

/assign @johnugeorge @andreyvelich

andreyvelich

Thanks a lot for implementing this @anencore94!
I left few comments.

pkg/suggestion/v1beta1/skopt/service.py

test/suggestion/v1beta1/test_skopt_service.py

andreyvelich · 2021-08-02T15:19:18Z

However, I'm afraid even if suggestion and experiment failed, corresponding deployment/pod stays in running... I'm not sure why does it happens, but I guess it is a bug and should be handled with another PR.

It's happening since Experiment ResumePolicy must be equal to Never or FromVolume to cleanup Suggestion resources after Experiment is complete.
Also, Suggestion should not be in Failed status to clean-up resources: https://github.com/kubeflow/katib/blob/master/pkg/controller.v1beta1/experiment/experiment_controller_util.go#L152-L154.
I think that helps users to debug failed Suggestion logs.

@gaocegege @anencore94 @johnugeorge What do you think about this clean-up design ?

andreyvelich · 2021-08-02T15:43:50Z

/ok-to-test

- use staticmethod rather than classmethod - change convertAlgorithmSpec method name to a snake_case - use .format() rather than f-string Signed-off-by: Jaeyeon Kim <anencore94@gmail.com>

gaocegege · 2021-08-03T06:32:17Z

Also, Suggestion should not be in Failed status to clean-up resources: https://github.com/kubeflow/katib/blob/master/pkg/controller.v1beta1/experiment/experiment_controller_util.go#L152-L154.
I think that helps users to debug failed Suggestion logs.

SGTM.

andreyvelich

Thank you for this great contribution @anencore94!
/lgtm
cc @gaocegege @johnugeorge

anencore94 · 2021-08-03T11:56:47Z

I think that helps users to debug failed Suggestion logs.I think that helps users to debug failed Suggestion logs.

Yeap That makes sense. I must be helpful when user wants to debug. That pod should be alive. Thanks! @andreyvelich

gaocegege · 2021-08-03T12:15:39Z

/lgtm

Thanks for your contribution! 🎉 👍 @anencore94

andreyvelich

/approve

google-oss-robot · 2021-08-03T20:05:07Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andreyvelich, anencore94

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [andreyvelich]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

[enh]: validate for skopt algorithm settings

ae2672e

aws-kf-ci-bot added the needs-ok-to-test label Jul 31, 2021

google-oss-robot requested review from hougangliu, johnugeorge and sperlingxx July 31, 2021 04:38

google-oss-robot added the size/L label Jul 31, 2021

anencore94 changed the title ~~[enh]: validate for skopt algorithm settings~~ [enh]: validate for bayesian optimization algorithm settings Jul 31, 2021

google-oss-robot assigned andreyvelich and johnugeorge Aug 1, 2021

andreyvelich reviewed Aug 2, 2021

View reviewed changes

aws-kf-ci-bot added ok-to-test and removed needs-ok-to-test labels Aug 2, 2021

[style]: refactor with reviews

ddaf191

- use staticmethod rather than classmethod - change convertAlgorithmSpec method name to a snake_case - use .format() rather than f-string Signed-off-by: Jaeyeon Kim <anencore94@gmail.com>

andreyvelich reviewed Aug 3, 2021

View reviewed changes

google-oss-robot added the lgtm label Aug 3, 2021

google-oss-robot assigned gaocegege Aug 3, 2021

andreyvelich approved these changes Aug 3, 2021

View reviewed changes

google-oss-robot added the approved label Aug 3, 2021

google-oss-robot merged commit a57745e into kubeflow:master Aug 3, 2021

anencore94 deleted the enhance/skopt_validation branch August 3, 2021 23:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[enh]: validate for bayesian optimization algorithm settings #1600

[enh]: validate for bayesian optimization algorithm settings #1600

anencore94 commented Jul 31, 2021 •

edited

Loading

aws-kf-ci-bot commented Jul 31, 2021

anencore94 commented Jul 31, 2021

gaocegege commented Aug 1, 2021

andreyvelich left a comment

andreyvelich commented Aug 2, 2021

andreyvelich commented Aug 2, 2021

gaocegege commented Aug 3, 2021

andreyvelich left a comment

anencore94 commented Aug 3, 2021

gaocegege commented Aug 3, 2021

andreyvelich left a comment

google-oss-robot commented Aug 3, 2021

[enh]: validate for bayesian optimization algorithm settings #1600

[enh]: validate for bayesian optimization algorithm settings #1600

Conversation

anencore94 commented Jul 31, 2021 • edited Loading

aws-kf-ci-bot commented Jul 31, 2021

anencore94 commented Jul 31, 2021

gaocegege commented Aug 1, 2021

andreyvelich left a comment

Choose a reason for hiding this comment

andreyvelich commented Aug 2, 2021

andreyvelich commented Aug 2, 2021

gaocegege commented Aug 3, 2021

andreyvelich left a comment

Choose a reason for hiding this comment

anencore94 commented Aug 3, 2021

gaocegege commented Aug 3, 2021

andreyvelich left a comment

Choose a reason for hiding this comment

google-oss-robot commented Aug 3, 2021

anencore94 commented Jul 31, 2021 •

edited

Loading