-
Notifications
You must be signed in to change notification settings - Fork 6
MPIJob in Kubeflow
After a short investigation we have found out, that MPIJob from Kubeflow package satisfies our requirements to StatefulJob.
MPI Job operator (https://github.com/kubeflow/mpi-operator) controls the behaviour of the job.
Statefulset is started to have a set of predictable names. As soon as the statefulset is initialised and running, operator starts a launcher job, that creates a pod with hostfile and kubectl ssh-mock. The launcher job runs mpirun with the necessary number of slots.
For the sake of clarity, we ignore k8s in the following schema. [Image: Authentication Sequence (1).png]
They don't. They just replace the ssh with kubectl.
env:
- name: OMPI_MCA_plm_rsh_agent
value: /etc/mpi/kubexec.sh
- name: OMPI_MCA_orte_default_hostfile
value: /etc/mpi/hostfile
kubexec.sh is injected through the config map.
apiVersion: v1
data:
hostfile: |
tensorflow-benchmarks-16-worker-0 slots=1
tensorflow-benchmarks-16-worker-1 slots=1
tensorflow-benchmarks-16-worker-2 slots=1
kubexec.sh: |
#!/bin/sh
set -x
POD_NAME=$1
shift
/opt/kube/kubectl exec ${POD_NAME} -- /bin/sh -c "$*"
kubectl is provided by initContainer. (https://github.com/kubeflow/mpi-operator/tree/master/cmd/kubectl-delivery (https://github.com/kubeflow/mpi-operator/tree/master/cmd/kubectl-delivery))
spec:
podManagementPolicy: Parallel
The nodes are started in parallel, so they are initialised in parallel (not sequentially, which is default).
Since ssh is not involved they just sleep.
sleep 365d
They don't. They just replace the ssh with kubectl exec.
As far I can see, OpenMPI and nothing else.
No idea. Worst case we can convert the file on the fly.
Launcher pod is typically just a mpirun or a thin wrapper around it. So it's placed on one of the nodes together with workers. It's our responsibility to avoid launcher causing troubles.