diff --git a/proposals/pytorch-operator-proposal.md b/proposals/pytorch-operator-proposal.md index 655b97efc..d88acdcb3 100644 --- a/proposals/pytorch-operator-proposal.md +++ b/proposals/pytorch-operator-proposal.md @@ -1,14 +1,14 @@ ## Motivation -Pytorch is a popular machine learning framework which currently does not have an operator/controller for Kubernetes. This proposal is aimed at defining what that operator should look like, and adding it to Kubeflow. +PyTorch is a popular machine learning framework which currently does not have an operator/controller for Kubernetes. This proposal is aimed at defining what that operator should look like, and adding it to Kubeflow. ## Goals -A Kubeflow user should be able to run training using Pytorch as easily as then can using Tensorflow. This proposal is centered around a Kubernetes operator for Pytorch. A user should be able to run both single node and distributed training jobs with Pytorch. +A Kubeflow user should be able to run training using PyTorch as easily as then can using Tensorflow. This proposal is centered around a Kubernetes operator for PyTorch. A user should be able to run both single node and distributed training jobs with PyTorch. This proposal defines the following: -- A Pytorch operator +- A PyTorch operator - A way to deploy the operator with ksonnet -- A single pod pytorch example -- A distributed pytorch example +- A single pod PyTorch example +- A distributed PyTorch example ## Non-Goals For the scope of this proposal, we won't be addressing the method for serving the model. @@ -17,9 +17,9 @@ For the scope of this proposal, we won't be addressing the method for serving th ### Custom Resource Definition The custom resource submitted to the Kubernetes API would look something like this: -``` +```yaml apiVersion: "kubeflow.org/v1alpha1" -kind: "PytorchJob" +kind: "PyTorchJob" metadata: name: "example-job" spec: @@ -45,14 +45,14 @@ spec: restartPolicy: OnFailure ``` -This PytorchJob resembles the existing TFJob for the tf-operator. The main differences being the omission of the parameter server replica type, and the addition of `masterPort` and `backend` options. +This PyTorchJob resembles the existing TFJob for the tf-operator. The main differences being the omission of the parameter server replica type, and the addition of `masterPort` and `backend` options. -`backend` Defines the protocol the pytorch workers will use to communicate when initializing the worker group. Information on the different backends (and the functions they support) can be found [here](http://pytorch.org/docs/master/distributed.html). +`backend` Defines the protocol the PyTorch workers will use to communicate when initializing the worker group. Information on the different backends (and the functions they support) can be found [here](http://pytorch.org/docs/master/distributed.html). `masterPort` Defines the port the group will use to communicate with the master's Kubernetes service. ### Resulting Master -``` +```yaml kind: Service apiVersion: v1 metadata: @@ -64,7 +64,7 @@ spec: - port: 23456 targetPort: 23456 ``` -``` +```yaml apiVersion: v1 kind: Pod metadata: @@ -92,10 +92,10 @@ spec: restartPolicy: OnFailure ``` -The master spec will create a service and a pod. The environment variables provided are used when initializing a distributed process group with pytorch. `WORLD_SIZE` is determined by adding the number of replicas in both 'MASTER' and 'WORKER' replicaSpecs. `RANK` is 0 for the master. +The master spec will create a service and a pod. The environment variables provided are used when initializing a distributed process group with PyTorch. `WORLD_SIZE` is determined by adding the number of replicas in both 'MASTER' and 'WORKER' replicaSpecs. `RANK` is 0 for the master. ### Resulting Worker -``` +```yaml apiVersion: v1 kind: Pod metadata: @@ -120,15 +120,15 @@ spec: The worker spec generates a pod. They will communicate to the master through the master's service name. ## Design -This is an implementaion of the pytorch distributed design patterns, found [here](http://pytorch.org/tutorials/intermediate/dist_tuto.html), via the lense of TFJob found [here](https://github.com/kubeflow/tf-operator). +This is an implementaion of the PyTorch distributed design patterns, found [here](http://pytorch.org/tutorials/intermediate/dist_tuto.html), via the lense of TFJob found [here](https://github.com/kubeflow/tf-operator). Diagram pending ## Alternatives Considered One alternative considered for the CRD spec is shown below: -``` +```yaml apiVersion: "kubeflow.org/v1alpha1" -kind: "PytorchJob" +kind: "PyTorchJob" metadata: name: "example-job" spec: