Skip to content

Commit

Permalink
Minor fixes
Browse files Browse the repository at this point in the history
  • Loading branch information
jose5918 committed Mar 19, 2018
1 parent 2d64818 commit 3e6a020
Showing 1 changed file with 16 additions and 16 deletions.
32 changes: 16 additions & 16 deletions proposals/pytorch-operator-proposal.md
Original file line number Diff line number Diff line change
@@ -1,14 +1,14 @@
## Motivation
Pytorch is a popular machine learning framework which currently does not have an operator/controller for Kubernetes. This proposal is aimed at defining what that operator should look like, and adding it to Kubeflow.
PyTorch is a popular machine learning framework which currently does not have an operator/controller for Kubernetes. This proposal is aimed at defining what that operator should look like, and adding it to Kubeflow.

## Goals
A Kubeflow user should be able to run training using Pytorch as easily as then can using Tensorflow. This proposal is centered around a Kubernetes operator for Pytorch. A user should be able to run both single node and distributed training jobs with Pytorch.
A Kubeflow user should be able to run training using PyTorch as easily as then can using Tensorflow. This proposal is centered around a Kubernetes operator for PyTorch. A user should be able to run both single node and distributed training jobs with PyTorch.

This proposal defines the following:
- A Pytorch operator
- A PyTorch operator
- A way to deploy the operator with ksonnet
- A single pod pytorch example
- A distributed pytorch example
- A single pod PyTorch example
- A distributed PyTorch example

## Non-Goals
For the scope of this proposal, we won't be addressing the method for serving the model.
Expand All @@ -17,9 +17,9 @@ For the scope of this proposal, we won't be addressing the method for serving th

### Custom Resource Definition
The custom resource submitted to the Kubernetes API would look something like this:
```
```yaml
apiVersion: "kubeflow.org/v1alpha1"
kind: "PytorchJob"
kind: "PyTorchJob"
metadata:
name: "example-job"
spec:
Expand All @@ -45,14 +45,14 @@ spec:
restartPolicy: OnFailure
```
This PytorchJob resembles the existing TFJob for the tf-operator. The main differences being the omission of the parameter server replica type, and the addition of `masterPort` and `backend` options.
This PyTorchJob resembles the existing TFJob for the tf-operator. The main differences being the omission of the parameter server replica type, and the addition of `masterPort` and `backend` options.

`backend` Defines the protocol the pytorch workers will use to communicate when initializing the worker group. Information on the different backends (and the functions they support) can be found [here](http://pytorch.org/docs/master/distributed.html).
`backend` Defines the protocol the PyTorch workers will use to communicate when initializing the worker group. Information on the different backends (and the functions they support) can be found [here](http://pytorch.org/docs/master/distributed.html).

`masterPort` Defines the port the group will use to communicate with the master's Kubernetes service.

### Resulting Master
```
```yaml
kind: Service
apiVersion: v1
metadata:
Expand All @@ -64,7 +64,7 @@ spec:
- port: 23456
targetPort: 23456
```
```
```yaml
apiVersion: v1
kind: Pod
metadata:
Expand Down Expand Up @@ -92,10 +92,10 @@ spec:
restartPolicy: OnFailure
```

The master spec will create a service and a pod. The environment variables provided are used when initializing a distributed process group with pytorch. `WORLD_SIZE` is determined by adding the number of replicas in both 'MASTER' and 'WORKER' replicaSpecs. `RANK` is 0 for the master.
The master spec will create a service and a pod. The environment variables provided are used when initializing a distributed process group with PyTorch. `WORLD_SIZE` is determined by adding the number of replicas in both 'MASTER' and 'WORKER' replicaSpecs. `RANK` is 0 for the master.

### Resulting Worker
```
```yaml
apiVersion: v1
kind: Pod
metadata:
Expand All @@ -120,15 +120,15 @@ spec:
The worker spec generates a pod. They will communicate to the master through the master's service name.

## Design
This is an implementaion of the pytorch distributed design patterns, found [here](http://pytorch.org/tutorials/intermediate/dist_tuto.html), via the lense of TFJob found [here](https://github.com/kubeflow/tf-operator).
This is an implementaion of the PyTorch distributed design patterns, found [here](http://pytorch.org/tutorials/intermediate/dist_tuto.html), via the lense of TFJob found [here](https://github.com/kubeflow/tf-operator).

Diagram pending

## Alternatives Considered
One alternative considered for the CRD spec is shown below:
```
```yaml
apiVersion: "kubeflow.org/v1alpha1"
kind: "PytorchJob"
kind: "PyTorchJob"
metadata:
name: "example-job"
spec:
Expand Down

0 comments on commit 3e6a020

Please sign in to comment.