-
Notifications
You must be signed in to change notification settings - Fork 442
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Katib doesn't support mpijob #1181
Comments
Issue-Label Bot is automatically applying the labels:
Please mark this comment with 👍 or 👎 to give our bot feedback! |
I'm working on it, and i can submit a PR if needed. |
Thanks. But prefer to have a proposal about it to illustrate that it works. Thanks for your contribution! 🎉 👍 |
@gaocegege Thanks for your reply. OK, thanks to #341 , now supporting mpijpb or other kubeflow jobs are not that complicated.
I've made some tests,here are some results just FYI. apiVersion: "kubeflow.org/v1alpha3"
kind: Experiment
metadata:
namespace: kubeflow
name: mpi-example1
spec:
parallelTrialCount: 2
maxTrialCount: 8
maxFailedTrialCount: 2
objective:
type: maximize
goal: 98
objectiveMetricName: Accuracy
algorithm:
algorithmName: random
trialTemplate:
goTemplate:
rawTemplate: |-
apiVersion: "kubeflow.org/v1alpha2"
kind: MPIJob
metadata:
name: {{.Trial}}
namespace: {{.NameSpace}}
spec:
slotsPerWorker: 1
cleanPodPolicy: None
mpiReplicaSpecs:
Launcher:
replicas: 1
template:
spec:
schedulerName: kube-batch
containers:
- image: ***
name: pytorch-mnist
command:
- mpirun
***
- python
- pytorch_mnist.py
- --epochs=2
- --batch-size=64
{{- with .HyperParameters}}
{{- range .}}
- "{{.Name}}={{.Value}}"
{{- end}}
{{- end}}
Worker:
***
parameters:
- name: --lr
parameterType: double
feasibleSpace:
min: "0.01"
max: "0.03" After 8 trials my experiment turns to succeeded state, its status detail is:
|
OK, LGTM. Welcome contributions! |
LGTM. |
/kind bug
What steps did you take and what happened:
Deploy katib and mpi-operator in my local kubernetes cluster,
Use kubectl to create an experiment using MPIJob, the creating result is failed, log is as follows:
What did you expect to happen:
Experiment created successfully, Trial and MPIJob can run properly.
Anything else you would like to add:
[Miscellaneous information that will assist in solving the issue.]
Now that only job、tfjob、pytorchJob are supported,conside to support mpi-operator.
Environment:
kubectl version
): 1.14.1/etc/os-release
): Ubuntu 16.04.4The text was updated successfully, but these errors were encountered: