Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed to run Grid search example #215

Closed
cheyang opened this issue Oct 17, 2018 · 2 comments · Fixed by #271
Closed

Failed to run Grid search example #215

cheyang opened this issue Oct 17, 2018 · 2 comments · Fixed by #271

Comments

@cheyang
Copy link

cheyang commented Oct 17, 2018

Hi,

I'm running the grid studyconfig with the latest docker image as below:

apiVersion: "kubeflow.org/v1alpha1"
kind: StudyJob
metadata:
  namespace: katib
  labels:
    controller-tools.k8s.io: "1.0"
  name: grid-example
spec:
  studyName: grid-example
  owner: crd
  optimizationtype: maximize
  objectivevaluename: Validation-accuracy
  optimizationgoal: 0.99
  requestcount: 5
  metricsnames:
    - accuracy
  parameterconfigs:
    - name: --num-layers
      parametertype: int
      feasible:
        min: "3"
        max: "5"
  workerSpec:
    goTemplate:
        rawTemplate: |-
          apiVersion: batch/v1
          kind: Job
          metadata:
            name: {{.WorkerID}}
            namespace: katib
          spec:
            template:
              spec:
                containers:
                - name: {{.WorkerID}}
                  image: katib/mxnet-mnist-example
                  command:
                  - "python"
                  - "/mxnet/example/image-classification/train_mnist.py"
                  - "--batch-size=64"
                  {{- with .HyperParameters}}
                  {{- range .}}
                  - "{{.Name}}={{.Value}}"
                  {{- end}}
                  {{- end}}
                restartPolicy: Never
  suggestionSpec:
    suggestionAlgorithm: "grid"
    requestNumber: 3

It's failed, and here is the log of studyjob-controller

# kubectl logs -n katib  studyjob-controller-56588dc6f9-57brp
2018/10/17 09:10:42 Create Study grid-example
2018/10/17 09:10:42 Study ID i0e8c554560d4144
2018/10/17 09:10:42 Study ID i0e8c554560d4144 StudyConfname:"grid-example" owner:"crd" optimization_type:MAXIMIZE optimization_goal:0.99 parameter_configs:<configs:<name:"--num-layers" parameter_type:INT feasible:<max:"5" min:"3" > > > objective_value_name:"Validation-accuracy" metrics:"accuracy" metrics:"Validation-accuracy" jobId:"86317dff-d1ec-11e8-85ae-00163e0b3368"
2018/10/17 09:10:42 Study: i0e8c554560d4144 Suggestion Spec &{grid [] 3}
2018/10/17 09:10:42 Study: i0e8c554560d4144 setSuggesitonParameterReply param_id:"l3ff4f1aea1a2f03"
2018/10/17 09:10:42 Study: i0e8c554560d4144 GetSuggestion Error rpc error: code = Unavailable desc = transport is closing
2018/10/17 09:10:42 Fail to check status rpc error: code = Unavailable desc = transport is closing

Here is the log of vizier-suggestion-grid

#kubectl logs -n katib vizier-suggestion-grid-6cbf4b548c-lkfrd
2018/10/17 08:29:32 Study h8fdaa614f26d314 iteration 0 DefaltGrid 1 Grids map[SuggestionCount:0]
2018/10/17 08:29:32 Study h8fdaa614f26d314 : 1 parameters generated

Here is the log of vizier-core

# kubectl logs -n katib vizier-core-b55b5f798-2vzpj
2018/10/17 09:10:45 Start Katib manager: 0.0.0.0:6789

Any suggestions?

@YujiOshima
Copy link
Contributor

Please try to set suggestionParameters like below.

apiVersion: "kubeflow.org/v1alpha1"
kind: StudyJob
metadata:
  namespace: katib
  labels:
    controller-tools.k8s.io: "1.0"
  name: grid-example
spec:
  studyName: grid-example
  owner: crd
  optimizationtype: maximize
  objectivevaluename: Validation-accuracy
  optimizationgoal: 0.99
  requestcount: 5
  metricsnames:
    - accuracy
  parameterconfigs:
    - name: --num-layers
      parametertype: int
      feasible:
        min: "3"
        max: "5"
  workerSpec:
    goTemplate:
        rawTemplate: |-
          apiVersion: batch/v1
          kind: Job
          metadata:
            name: {{.WorkerID}}
            namespace: katib
          spec:
            template:
              spec:
                containers:
                - name: {{.WorkerID}}
                  image: katib/mxnet-mnist-example
                  command:
                  - "python"
                  - "/mxnet/example/image-classification/train_mnist.py"
                  - "--batch-size=64"
                  {{- with .HyperParameters}}
                  {{- range .}}
                  - "{{.Name}}={{.Value}}"
                  {{- end}}
                  {{- end}}
                restartPolicy: Never
  suggestionSpec:
    suggestionAlgorithm: "grid"
    requestNumber: 3
    suggestionParameters:
      -
          name: "DefaultGrid"
          value: "3"

@cheyang
Copy link
Author

cheyang commented Oct 18, 2018

Thanks, yeah! I'm wondering why it's failed, and how to debug it. Please let me know when you have time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants