Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Katib doesn't support mpijob #1181

Closed
YuxiJin-tobeyjin opened this issue May 8, 2020 · 6 comments · Fixed by #1342
Closed

Katib doesn't support mpijob #1181

YuxiJin-tobeyjin opened this issue May 8, 2020 · 6 comments · Fixed by #1342

Comments

@YuxiJin-tobeyjin
Copy link

YuxiJin-tobeyjin commented May 8, 2020

/kind bug

What steps did you take and what happened:
Deploy katib and mpi-operator in my local kubernetes cluster,

kubectl get po -n kubeflow
NAME                                   READY   STATUS    RESTARTS   AGE
katib-controller-b6dc87fcb-2lrtj       1/1     Running   0          26h
katib-db-manager-79fd46648b-scxx8      1/1     Running   0          2d3h
katib-mysql-7f8bc6956f-fxkgl           1/1     Running   0          13d
katib-ui-74bcbd8b75-bwppw              1/1     Running   0          13d

Use kubectl to create an experiment using MPIJob, the creating result is failed, log is as follows:

Error from server: error when creating "tt-katib.yaml": admission webhook "validating.experiment.katib.kubeflow.org" denied the request: Invalid spec.trialTemplate: Job type kubeflow.org/v1alpha2, Kind=MPIJob not supported.

What did you expect to happen:
Experiment created successfully, Trial and MPIJob can run properly.

Anything else you would like to add:
[Miscellaneous information that will assist in solving the issue.]
Now that only job、tfjob、pytorchJob are supported,conside to support mpi-operator.

Environment:

  • Kubernetes version: (use kubectl version): 1.14.1
  • OS (e.g. from /etc/os-release): Ubuntu 16.04.4
@issue-label-bot
Copy link

Issue-Label Bot is automatically applying the labels:

Label Probability
area/katib 1.00

Please mark this comment with 👍 or 👎 to give our bot feedback!
Links: app homepage, dashboard and code for this bot.

@YuxiJin-tobeyjin
Copy link
Author

YuxiJin-tobeyjin commented May 8, 2020

I'm working on it, and i can submit a PR if needed.

@gaocegege
Copy link
Member

@YuxiJin-tobeyjin

Thanks. But prefer to have a proposal about it to illustrate that it works.

Thanks for your contribution! 🎉 👍

@YuxiJin-tobeyjin
Copy link
Author

YuxiJin-tobeyjin commented May 9, 2020

@gaocegege Thanks for your reply.

OK, thanks to #341 , now supporting mpijpb or other kubeflow jobs are not that complicated.
As for mpijob the modifications are listed as follows:

  1. Modify katib-controller clusterRole to add mpijobs.
  2. Add mpijob defination in katib const and related handling during job init.
  3. As mpijob has no master, it only consists of launcher and workers, so the metrics sideCar should be added to launcher instead, thus related logic is needed to realize.

I've made some tests,here are some results just FYI.
My experiment configuration is like this:

apiVersion: "kubeflow.org/v1alpha3"
kind: Experiment
metadata:
  namespace: kubeflow
  name: mpi-example1
spec:
  parallelTrialCount: 2                                                                                                                    
  maxTrialCount: 8  
  maxFailedTrialCount: 2
objective:
    type: maximize                                                                                                                         
    goal: 98                                                                                                                             
    objectiveMetricName: Accuracy                                                                                                          
  algorithm:
    algorithmName: random 
  trialTemplate:
    goTemplate:
        rawTemplate: |-
          apiVersion: "kubeflow.org/v1alpha2"
          kind: MPIJob   
          metadata:
            name: {{.Trial}}
            namespace: {{.NameSpace}}
          spec:
            slotsPerWorker: 1
            cleanPodPolicy: None
            mpiReplicaSpecs:
              Launcher:
                replicas: 1
                template:
                  spec:
                    schedulerName: kube-batch
                    containers:
                    - image: ***
                      name: pytorch-mnist
                      command:
                      - mpirun
                      ***
                      - python
                      - pytorch_mnist.py
                      - --epochs=2                                                                                                          
                      - --batch-size=64
                      {{- with .HyperParameters}}                                                                                     
                      {{- range .}}
                      - "{{.Name}}={{.Value}}"
                      {{- end}}
                      {{- end}}

              Worker:
              ***
  parameters:                                                                                                                             
    - name: --lr
      parameterType: double
      feasibleSpace:
        min: "0.01"
        max: "0.03"

After 8 trials my experiment turns to succeeded state, its status detail is:

  status:
    completionTime: "2020-05-07T09:02:42Z"
    conditions:
    - lastTransitionTime: "2020-05-07T08:56:34Z"
      lastUpdateTime: "2020-05-07T08:56:34Z"
      message: Experiment is created
      reason: ExperimentCreated
      status: "True"
      type: Created
    - lastTransitionTime: "2020-05-07T09:02:42Z"
      lastUpdateTime: "2020-05-07T09:02:42Z"
      message: Experiment is running
      reason: ExperimentRunning
      status: "False"
      type: Running
    - lastTransitionTime: "2020-05-07T09:02:42Z"
      lastUpdateTime: "2020-05-07T09:02:42Z"
      message: Experiment has succeeded because max trial count has reached
      reason: ExperimentMaxTrialsReached
      status: "True"
      type: Succeeded
    currentOptimalTrial:
      bestTrialName: mpi-example1-dzhq62b5
      observation:
        metrics:
        - name: Accuracy
          value: 96.95
      parameterAssignments:
      - name: --lr
        value: "0.022062715753755423"
    startTime: "2020-05-07T08:56:34Z"
    succeededTrialList:
    - mpi-example1-5qw8hp9g
    - mpi-example1-7zpz4hmv
    - mpi-example1-9vxv2dks
    - mpi-example1-dzhq62b5
    - mpi-example1-kn6plkg7
    - mpi-example1-rfqwgmxh
    - mpi-example1-tbg2bkdx
    - mpi-example1-vtxtrjnd
    trials: 8
    trialsSucceeded: 8

@gaocegege
Copy link
Member

OK, LGTM. Welcome contributions!

/cc @johnugeorge @andreyvelich

@johnugeorge
Copy link
Member

LGTM.
Thanks @YuxiJin-tobeyjin for your contribution

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants