Katib doesn't support mpijob #1181

YuxiJin-tobeyjin · 2020-05-08T10:20:50Z

/kind bug

What steps did you take and what happened:
Deploy katib and mpi-operator in my local kubernetes cluster，

kubectl get po -n kubeflow
NAME                                   READY   STATUS    RESTARTS   AGE
katib-controller-b6dc87fcb-2lrtj       1/1     Running   0          26h
katib-db-manager-79fd46648b-scxx8      1/1     Running   0          2d3h
katib-mysql-7f8bc6956f-fxkgl           1/1     Running   0          13d
katib-ui-74bcbd8b75-bwppw              1/1     Running   0          13d

Use kubectl to create an experiment using MPIJob, the creating result is failed, log is as follows:

Error from server: error when creating "tt-katib.yaml": admission webhook "validating.experiment.katib.kubeflow.org" denied the request: Invalid spec.trialTemplate: Job type kubeflow.org/v1alpha2, Kind=MPIJob not supported.

What did you expect to happen:
Experiment created successfully, Trial and MPIJob can run properly.

Anything else you would like to add:
[Miscellaneous information that will assist in solving the issue.]
Now that only job、tfjob、pytorchJob are supported，conside to support mpi-operator.

Environment:

Kubernetes version: (use kubectl version): 1.14.1
OS (e.g. from /etc/os-release): Ubuntu 16.04.4

The text was updated successfully, but these errors were encountered:

issue-label-bot · 2020-05-08T10:20:57Z

Issue-Label Bot is automatically applying the labels:

Label	Probability
area/katib	1.00

Please mark this comment with 👍 or 👎 to give our bot feedback!
Links: app homepage, dashboard and code for this bot.

YuxiJin-tobeyjin · 2020-05-08T10:23:35Z

I'm working on it, and i can submit a PR if needed.

gaocegege · 2020-05-09T08:06:26Z

@YuxiJin-tobeyjin

Thanks. But prefer to have a proposal about it to illustrate that it works.

Thanks for your contribution! 🎉 👍

YuxiJin-tobeyjin · 2020-05-09T09:43:56Z

@gaocegege Thanks for your reply.

OK, thanks to #341 , now supporting mpijpb or other kubeflow jobs are not that complicated.
As for mpijob the modifications are listed as follows:

Modify katib-controller clusterRole to add mpijobs.
Add mpijob defination in katib const and related handling during job init.
As mpijob has no master, it only consists of launcher and workers, so the metrics sideCar should be added to launcher instead, thus related logic is needed to realize.

I've made some tests，here are some results just FYI.
My experiment configuration is like this:

apiVersion: "kubeflow.org/v1alpha3"
kind: Experiment
metadata:
  namespace: kubeflow
  name: mpi-example1
spec:
  parallelTrialCount: 2                                                                                                                    
  maxTrialCount: 8  
  maxFailedTrialCount: 2
objective:
    type: maximize                                                                                                                         
    goal: 98                                                                                                                             
    objectiveMetricName: Accuracy                                                                                                          
  algorithm:
    algorithmName: random 
  trialTemplate:
    goTemplate:
        rawTemplate: |-
          apiVersion: "kubeflow.org/v1alpha2"
          kind: MPIJob   
          metadata:
            name: {{.Trial}}
            namespace: {{.NameSpace}}
          spec:
            slotsPerWorker: 1
            cleanPodPolicy: None
            mpiReplicaSpecs:
              Launcher:
                replicas: 1
                template:
                  spec:
                    schedulerName: kube-batch
                    containers:
                    - image: ***
                      name: pytorch-mnist
                      command:
                      - mpirun
                      ***
                      - python
                      - pytorch_mnist.py
                      - --epochs=2                                                                                                          
                      - --batch-size=64
                      {{- with .HyperParameters}}                                                                                     
                      {{- range .}}
                      - "{{.Name}}={{.Value}}"
                      {{- end}}
                      {{- end}}

              Worker:
              ***
  parameters:                                                                                                                             
    - name: --lr
      parameterType: double
      feasibleSpace:
        min: "0.01"
        max: "0.03"

After 8 trials my experiment turns to succeeded state, its status detail is:

  status:
    completionTime: "2020-05-07T09:02:42Z"
    conditions:
    - lastTransitionTime: "2020-05-07T08:56:34Z"
      lastUpdateTime: "2020-05-07T08:56:34Z"
      message: Experiment is created
      reason: ExperimentCreated
      status: "True"
      type: Created
    - lastTransitionTime: "2020-05-07T09:02:42Z"
      lastUpdateTime: "2020-05-07T09:02:42Z"
      message: Experiment is running
      reason: ExperimentRunning
      status: "False"
      type: Running
    - lastTransitionTime: "2020-05-07T09:02:42Z"
      lastUpdateTime: "2020-05-07T09:02:42Z"
      message: Experiment has succeeded because max trial count has reached
      reason: ExperimentMaxTrialsReached
      status: "True"
      type: Succeeded
    currentOptimalTrial:
      bestTrialName: mpi-example1-dzhq62b5
      observation:
        metrics:
        - name: Accuracy
          value: 96.95
      parameterAssignments:
      - name: --lr
        value: "0.022062715753755423"
    startTime: "2020-05-07T08:56:34Z"
    succeededTrialList:
    - mpi-example1-5qw8hp9g
    - mpi-example1-7zpz4hmv
    - mpi-example1-9vxv2dks
    - mpi-example1-dzhq62b5
    - mpi-example1-kn6plkg7
    - mpi-example1-rfqwgmxh
    - mpi-example1-tbg2bkdx
    - mpi-example1-vtxtrjnd
    trials: 8
    trialsSucceeded: 8

gaocegege · 2020-05-11T01:34:24Z

OK, LGTM. Welcome contributions!

/cc @johnugeorge @andreyvelich

johnugeorge · 2020-05-11T04:58:34Z

LGTM.
Thanks @YuxiJin-tobeyjin for your contribution

k8s-ci-robot added the kind/bug label May 8, 2020

issue-label-bot bot added the area/katib label May 8, 2020

YuxiJin-tobeyjin mentioned this issue May 11, 2020

feature: add support for mpijob in katib #1183

Closed

andreyvelich mentioned this issue Sep 17, 2020

Add MPI operator horovod example #1342

Merged

k8s-ci-robot closed this as completed in #1342 Oct 17, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Katib doesn't support mpijob #1181

Katib doesn't support mpijob #1181

YuxiJin-tobeyjin commented May 8, 2020 •

edited

Loading

issue-label-bot bot commented May 8, 2020

YuxiJin-tobeyjin commented May 8, 2020 •

edited

Loading

gaocegege commented May 9, 2020

YuxiJin-tobeyjin commented May 9, 2020 •

edited

Loading

gaocegege commented May 11, 2020

johnugeorge commented May 11, 2020

Katib doesn't support mpijob #1181

Katib doesn't support mpijob #1181

Comments

YuxiJin-tobeyjin commented May 8, 2020 • edited Loading

issue-label-bot bot commented May 8, 2020

YuxiJin-tobeyjin commented May 8, 2020 • edited Loading

gaocegege commented May 9, 2020

YuxiJin-tobeyjin commented May 9, 2020 • edited Loading

gaocegege commented May 11, 2020

johnugeorge commented May 11, 2020

YuxiJin-tobeyjin commented May 8, 2020 •

edited

Loading

YuxiJin-tobeyjin commented May 8, 2020 •

edited

Loading

YuxiJin-tobeyjin commented May 9, 2020 •

edited

Loading