-
Notifications
You must be signed in to change notification settings - Fork 220
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pytorch operator proposal #33
Pytorch operator proposal #33
Conversation
/ok-to-test |
@@ -0,0 +1,141 @@ | |||
## Motivation | |||
Pytorch is a popular machine learning framework which currently does not have an operator/controller for Kubernetes. This proposal is aimed at defining what that operator should look like, and adding it to Kubeflow. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My biggest question about pytorch is how to do distributed communication. Based on my limited research
kubeflow/kubeflow#179 it looks like there are multiple backends e.g.Gloo and MPI.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've tried it with TCP and Gloo in Kubernetes and it seems to work okay, but I need to build PyTorch with MPI already installed if I want to try it. I also don't have a GPU cluster so I'm probably not seeing the bigger benefits of the different backends. I was thinking just provide the backend as an environment variable and people can choose to use that environment variable in their code if they want or just hard code their backend.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm gonna have to take back the gloo comment. It's definitely not working right aside from connecting
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jose5918 If you need GPUs just let me know you can use dev.kubeflow.org.
masterPort: "23456" | ||
replicaSpecs: | ||
- replicas: 1 | ||
ReplicaType: MASTER |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI, tf-operator will use this struct:
map[ReplicaType]*ReplicaSpec
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I think I will change to use a map
I think this is a great idea. I'd love to see work supporting PyTorch start as soon as possible; whether distributed or not (e.g. just adding it to our notebook images). I'll leave that to whoever wants to work on this. So I don't think we should block development waiting for a fully fleshed out proposal. I'm fine submitting the proposal as is or leaving it as an open PR which ever people think best. However, @jose5918 if/when you're ready to start coding I suggest we just create a repo "experimental-pytorch" and let people who are interested start working. |
@@ -0,0 +1,141 @@ | |||
## Motivation | |||
Pytorch is a popular machine learning framework which currently does not have an operator/controller for Kubernetes. This proposal is aimed at defining what that operator should look like, and adding it to Kubeflow. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we use PyTorch
instea of Pytorch
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I'll update
|
||
### Custom Resource Definition | ||
The custom resource submitted to the Kubernetes API would look something like this: | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add yaml
after three ticks
6dccf41
to
21cad4e
Compare
/lgtm My suggestion is we accept the proposal and iterate. Does someone else want to provide a second lgtm? |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: jlewi The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/hold |
hey guys, thanks a lot for the proposal. I have reviewed it and found no obvious holes wrt pytorch perspective. |
Currently a work in progress proposal for pytorch-operator. I ran distributed pytorch on kubernetes with similar specs so I wanted to start the discussion.
I intend have a POC by the end of next week
This change is