Skip to content

Commit

Permalink
Add more information in design
Browse files Browse the repository at this point in the history
  • Loading branch information
jose5918 committed Mar 19, 2018
1 parent 3e6a020 commit 21cad4e
Show file tree
Hide file tree
Showing 2 changed files with 4 additions and 2 deletions.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
6 changes: 4 additions & 2 deletions proposals/pytorch-operator-proposal.md
Original file line number Diff line number Diff line change
Expand Up @@ -120,9 +120,11 @@ spec:
The worker spec generates a pod. They will communicate to the master through the master's service name.

## Design
This is an implementaion of the PyTorch distributed design patterns, found [here](http://pytorch.org/tutorials/intermediate/dist_tuto.html), via the lense of TFJob found [here](https://github.com/kubeflow/tf-operator).
This is an implementaion of the PyTorch distributed design patterns, found [here](http://pytorch.org/tutorials/intermediate/dist_tuto.html), via the lense of TFJob found [here](https://github.com/kubeflow/tf-operator). In the case of Kubernetes, because the operator is able to easily apply configurations to each process, we will use the environment variable initialization method found [here](http://pytorch.org/tutorials/intermediate/dist_tuto.html#initialization-methods).

In most training examples, the pods will communicate via the all-reduce function in order to average the gradients.
![All-Reduce Pytorch](diagrams/all-reduce-pytorch-operator.jpeg)

Diagram pending

## Alternatives Considered
One alternative considered for the CRD spec is shown below:
Expand Down

0 comments on commit 21cad4e

Please sign in to comment.