Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pytorch operator proposal #33

Merged
merged 3 commits into from
Mar 20, 2018

Conversation

jose5918
Copy link
Contributor

@jose5918 jose5918 commented Mar 8, 2018

Currently a work in progress proposal for pytorch-operator. I ran distributed pytorch on kubernetes with similar specs so I wanted to start the discussion.

I intend have a POC by the end of next week


This change is Reviewable

@gaocegege
Copy link
Member

/ok-to-test

@@ -0,0 +1,141 @@
## Motivation
Pytorch is a popular machine learning framework which currently does not have an operator/controller for Kubernetes. This proposal is aimed at defining what that operator should look like, and adding it to Kubeflow.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My biggest question about pytorch is how to do distributed communication. Based on my limited research
kubeflow/kubeflow#179 it looks like there are multiple backends e.g.Gloo and MPI.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've tried it with TCP and Gloo in Kubernetes and it seems to work okay, but I need to build PyTorch with MPI already installed if I want to try it. I also don't have a GPU cluster so I'm probably not seeing the bigger benefits of the different backends. I was thinking just provide the backend as an environment variable and people can choose to use that environment variable in their code if they want or just hard code their backend.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm gonna have to take back the gloo comment. It's definitely not working right aside from connecting

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jose5918 If you need GPUs just let me know you can use dev.kubeflow.org.

masterPort: "23456"
replicaSpecs:
- replicas: 1
ReplicaType: MASTER
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, tf-operator will use this struct:

map[ReplicaType]*ReplicaSpec

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think I will change to use a map

@jlewi
Copy link
Contributor

jlewi commented Mar 8, 2018

I think this is a great idea.

I'd love to see work supporting PyTorch start as soon as possible; whether distributed or not (e.g. just adding it to our notebook images). I'll leave that to whoever wants to work on this.

So I don't think we should block development waiting for a fully fleshed out proposal. I'm fine submitting the proposal as is or leaving it as an open PR which ever people think best.

However, @jose5918 if/when you're ready to start coding I suggest we just create a repo "experimental-pytorch" and let people who are interested start working.

@@ -0,0 +1,141 @@
## Motivation
Pytorch is a popular machine learning framework which currently does not have an operator/controller for Kubernetes. This proposal is aimed at defining what that operator should look like, and adding it to Kubeflow.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we use PyTorch instea of Pytorch?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I'll update


### Custom Resource Definition
The custom resource submitted to the Kubernetes API would look something like this:
```
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add yaml after three ticks

@jose5918
Copy link
Contributor Author

jose5918 commented Mar 9, 2018

@jlewi I have started something already, but what would be the correct process to create a repo? Creating an issue like this? #35

@jose5918 jose5918 changed the title [WIP] Pytorch operator proposal Pytorch operator proposal Mar 19, 2018
@jose5918 jose5918 force-pushed the pytorch-operator-proposal branch from 6dccf41 to 21cad4e Compare March 19, 2018 18:02
@jlewi
Copy link
Contributor

jlewi commented Mar 20, 2018

/lgtm

My suggestion is we accept the proposal and iterate. Does someone else want to provide a second lgtm?

@k8s-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jlewi

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@jlewi
Copy link
Contributor

jlewi commented Mar 20, 2018

/hold

@k8s-ci-robot k8s-ci-robot merged commit fb7a359 into kubeflow:master Mar 20, 2018
@soumith
Copy link

soumith commented Mar 23, 2018

hey guys, thanks a lot for the proposal. I have reviewed it and found no obvious holes wrt pytorch perspective.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants