Skip to content
This repository has been archived by the owner on May 13, 2024. It is now read-only.

MPIJob in Kubeflow

Edison Gustavo Muenz edited this page Mar 20, 2019 · 1 revision

*tl;dr *

After a short investigation we have found out, that MPIJob from Kubeflow package satisfies our requirements to StatefulJob.

How it works?

MPI Job operator (https://github.com/kubeflow/mpi-operator) controls the behaviour of the job.

Statefulset is started to have a set of predictable names. As soon as the statefulset is initialised and running, operator starts a launcher job, that creates a pod with hostfile and kubectl ssh-mock. The launcher job runs mpirun with the necessary number of slots.

For the sake of clarity, we ignore k8s in the following schema. [Image: Authentication Sequence (1).png]

How do they solve the DNS naming?

They don't. They just replace the ssh with kubectl.

    env:
    - name: OMPI_MCA_plm_rsh_agent
      value: /etc/mpi/kubexec.sh
    - name: OMPI_MCA_orte_default_hostfile
      value: /etc/mpi/hostfile

kubexec.sh is injected through the config map.

apiVersion: v1
  data:
    hostfile: |
      tensorflow-benchmarks-16-worker-0 slots=1
      tensorflow-benchmarks-16-worker-1 slots=1
      tensorflow-benchmarks-16-worker-2 slots=1
    kubexec.sh: |
      #!/bin/sh
      set -x
      POD_NAME=$1
      shift
      /opt/kube/kubectl exec ${POD_NAME} -- /bin/sh -c "$*"

kubectl is provided by initContainer. (https://github.com/kubeflow/mpi-operator/tree/master/cmd/kubectl-delivery (https://github.com/kubeflow/mpi-operator/tree/master/cmd/kubectl-delivery))

How do they start the nodes?

  spec:
    podManagementPolicy: Parallel

The nodes are started in parallel, so they are initialised in parallel (not sequentially, which is default).

What do the workers do?

Since ssh is not involved they just sleep.

sleep 365d

How do they manage ssh?

They don't. They just replace the ssh with kubectl exec.

What are the requirements to container?

As far I can see, OpenMPI and nothing else.

How do they handle different formats of the hostfile for the different MPI-Implementations?

No idea. Worst case we can convert the file on the fly.

How is launcher placed?

Launcher pod is typically just a mpirun or a thin wrapper around it. So it's placed on one of the nodes together with workers. It's our responsibility to avoid launcher causing troubles.