Skip to content

Latest commit

 

History

History
154 lines (129 loc) · 5.5 KB

caffe2-operator-proposal.md

File metadata and controls

154 lines (129 loc) · 5.5 KB

Table of Contents generated with DocToc

Status

  • 2018-04-06 - Accepted

Motivation

Caffe2 is a popular machine learning framework which currently does not have an operator/controller for Kubernetes. This proposal is aimed at defining what that operator should look like, and adding it to Kubeflow.

For distributed training, Caffe2 has no parameter server compared with Tensorflow, so it has to use Redis/MPI to find the other nodes to communicate.

Goals

A Kubeflow user should be able to run training using Caffe2 as easily as they can using Tensorflow. This proposal is centered around a Kubernetes operator for Caffe2. A user should be able to run both single node and distributed training jobs with Caffe2.

This proposal defines the following:

  • A Caffe2 operator
  • A way to deploy the operator with kubectl
  • A single pod Caffe2 example
  • A distributed Caffe2 example
  • A distributed Caffe2 proposal with batchd scheduler

Non-Goals

For the scope of this proposal, we won't be addressing the method for serving the model.

API (CRD and resulting objects)

Custom Resource Definition

The custom resource submitted to the Kubernetes API would look something like this:

apiVersion: "kubeflow.org/v1alpha1"
kind: "Caffe2Job"
metadata:
  name: "example-job"
spec:
  backend: "redis"
  replicaSpecs:
    - replicas: 1
      ReplicaType: MASTER
      template:
        spec:
          hostNetwork: true
          containers:
            - image: carmark/caffe2:latest
              name: caffe2
              securityContext:
                capabilities:
                  add: ["ALL"]
          restartPolicy: Never
    - replicas: 2
      ReplicaType: WORKER
      template:
        spec:
          hostNetwork: true
          containers:
            - image: carmark/caffe2:latest
              securityContext:
                capabilities:
                  add: ["ALL"]
              name: caffe2
          restartPolicy: Never
    - replicas: 1
      ReplicaType: HELPER
      template:
        spec:
          containers:
            - image: redis:latest
              name: redis
              ports:
                - containerPort: 6379
          restartPolicy: Never

This Caffe2Job resembles the existing TFJob for the tf-operator. The main differences being the omission of the parameter server replica type, and the addition of backend options and HELPER replica type.

backend Defines the distributed type the Caffe2 master and workers will use to communicate when initializing the worker group. Information on the different backends (and the functions they support) can be found here.

HELPER replica type will be used to service finding for redis backend, and will be useless for gloo backend.

Resulting Master

apiVersion: v1
kind: Pod
metadata:
  name: caffe2-master-${job_id}
  labels:
    app=caffe2-job-xx
    caffe2_job_name=example-job
    controller-uid=dc3669c6-29f1-11e8-9ccd-ac1f6b8040c6
    job-name=example-job-master-20lm-1
    job_type=MASTER
    kubeflow.org=
    runtime_id=20lm
    task_index=0
spec:
  containers:
  - image: carmark/caffe2:latest
    imagePullPolicy: IfNotPresent
    name: caffe2
  restartPolicy: Never

The labels variables provided are used when initializing a distributed process group with Caffe2. task_index is determined by adding the number of replicas in each 'MASTER' and 'WORKER' replicaSpecs. job_type is MASTER for the master.

Resulting Worker

apiVersion: v1
kind: Pod
metadata:
  name: caffe2-worker-${job_id}
  labels:
    app=caffe2-job-xx
    caffe2_job_name=example-job
    controller-uid=dc3669c6-29f1-11e8-9ccd-ac1f6b8040c6
    job-name=example-job-worker-20lm-0
    job_type=WORKER
    kubeflow.org=
    runtime_id=20lm
    task_index=0
spec:
  containers:
  - image: carmark/caffe2:latest
    imagePullPolicy: IfNotPresent
    name: caffe2
  restartPolicy: Never

The worker spec generates a pod. They will communicate to the master through the redis's service name.

Design

This is an implementaion of the Caffe2 distributed design patterns, found here, via the lense of TFJob found here.

Diagram pending

Other backends

Form here, Caffe2 also support Gloo which is another communications library for multi-machine training. For Gloo with MPI, we do not neet the redis to communicate, the master and workers will communicate by ssh. So it should better to define another sshd port to communicate in container, then start the works first and then master container.

To finish this start process, we may invole the batchd scheduler and use priority class to define the priority.