Pytorch operator proposal #33

jose5918 · 2018-03-08T03:14:39Z

Currently a work in progress proposal for pytorch-operator. I ran distributed pytorch on kubernetes with similar specs so I wanted to start the discussion.

I intend have a POC by the end of next week

This change is

gaocegege · 2018-03-08T03:37:21Z

/ok-to-test

jlewi · 2018-03-08T14:37:03Z

proposals/pytorch-operator-proposal.md

@@ -0,0 +1,141 @@
+## Motivation
+Pytorch is a popular machine learning framework which currently does not have an operator/controller for Kubernetes. This proposal is aimed at defining what that operator should look like, and adding it to Kubeflow.


My biggest question about pytorch is how to do distributed communication. Based on my limited research
kubeflow/kubeflow#179 it looks like there are multiple backends e.g.Gloo and MPI.

I've tried it with TCP and Gloo in Kubernetes and it seems to work okay, but I need to build PyTorch with MPI already installed if I want to try it. I also don't have a GPU cluster so I'm probably not seeing the bigger benefits of the different backends. I was thinking just provide the backend as an environment variable and people can choose to use that environment variable in their code if they want or just hard code their backend.

I'm gonna have to take back the gloo comment. It's definitely not working right aside from connecting

@jose5918 If you need GPUs just let me know you can use dev.kubeflow.org.

ScorpioCPH · 2018-03-08T14:05:52Z

proposals/pytorch-operator-proposal.md

+  masterPort: "23456"
+  replicaSpecs:
+    - replicas: 1
+      ReplicaType: MASTER


FYI, tf-operator will use this struct:

map[ReplicaType]*ReplicaSpec

Yeah I think I will change to use a map

jlewi · 2018-03-08T14:41:43Z

I think this is a great idea.

I'd love to see work supporting PyTorch start as soon as possible; whether distributed or not (e.g. just adding it to our notebook images). I'll leave that to whoever wants to work on this.

So I don't think we should block development waiting for a fully fleshed out proposal. I'm fine submitting the proposal as is or leaving it as an open PR which ever people think best.

However, @jose5918 if/when you're ready to start coding I suggest we just create a repo "experimental-pytorch" and let people who are interested start working.

jimexist · 2018-03-08T14:47:15Z

proposals/pytorch-operator-proposal.md

@@ -0,0 +1,141 @@
+## Motivation
+Pytorch is a popular machine learning framework which currently does not have an operator/controller for Kubernetes. This proposal is aimed at defining what that operator should look like, and adding it to Kubeflow.


should we use PyTorch instea of Pytorch?

Yeah I'll update

jimexist · 2018-03-08T14:47:46Z

proposals/pytorch-operator-proposal.md

+
+### Custom Resource Definition
+The custom resource submitted to the Kubernetes API would look something like this:
+```


add yaml after three ticks

jose5918 · 2018-03-09T02:38:28Z

@jlewi I have started something already, but what would be the correct process to create a repo? Creating an issue like this? #35

jlewi · 2018-03-20T02:37:38Z

/lgtm

My suggestion is we accept the proposal and iterate. Does someone else want to provide a second lgtm?

k8s-ci-robot · 2018-03-20T02:37:40Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jlewi

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [jlewi]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

jlewi · 2018-03-20T02:37:51Z

/hold

soumith · 2018-03-23T15:51:30Z

hey guys, thanks a lot for the proposal. I have reviewed it and found no obvious holes wrt pytorch perspective.

Pytorch operator proposal

2d64818

k8s-ci-robot added the do-not-merge/work-in-progress label Mar 8, 2018

k8s-ci-robot requested review from jimexist and jlewi March 8, 2018 03:14

k8s-ci-robot added needs-ok-to-test size/L labels Mar 8, 2018

k8s-ci-robot removed the needs-ok-to-test label Mar 8, 2018

jlewi reviewed Mar 8, 2018

View reviewed changes

ScorpioCPH reviewed Mar 8, 2018

View reviewed changes

jimexist reviewed Mar 8, 2018

View reviewed changes

jose5918 mentioned this pull request Mar 9, 2018

Create kubeflow/experimental-pytorch #35

Closed

elsonrodriguez mentioned this pull request Mar 10, 2018

[WIP] Initial operator POC kubeflow/pytorch-operator#2

Closed

jose5918 changed the title ~~[WIP] Pytorch operator proposal~~ Pytorch operator proposal Mar 19, 2018

k8s-ci-robot removed the do-not-merge/work-in-progress label Mar 19, 2018

jose5918 added 2 commits March 19, 2018 11:01

Minor fixes

3e6a020

Add more information in design

21cad4e

jose5918 force-pushed the pytorch-operator-proposal branch from 6dccf41 to 21cad4e Compare March 19, 2018 18:02

k8s-ci-robot assigned jlewi Mar 20, 2018

k8s-ci-robot added the lgtm label Mar 20, 2018

k8s-ci-robot added the approved label Mar 20, 2018

k8s-ci-robot added the do-not-merge/hold label Mar 20, 2018

k8s-ci-robot merged commit fb7a359 into kubeflow:master Mar 20, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pytorch operator proposal #33

Pytorch operator proposal #33

jose5918 commented Mar 8, 2018 •

edited by jlewi

Loading

gaocegege commented Mar 8, 2018

jlewi Mar 8, 2018

jose5918 Mar 8, 2018

jose5918 Mar 8, 2018

jlewi Mar 9, 2018

ScorpioCPH Mar 8, 2018

jose5918 Mar 19, 2018

jlewi commented Mar 8, 2018

jimexist Mar 8, 2018

jose5918 Mar 8, 2018

jimexist Mar 8, 2018

jose5918 commented Mar 9, 2018

jlewi commented Mar 20, 2018

k8s-ci-robot commented Mar 20, 2018

jlewi commented Mar 20, 2018

soumith commented Mar 23, 2018

		@@ -0,0 +1,141 @@
		## Motivation
		Pytorch is a popular machine learning framework which currently does not have an operator/controller for Kubernetes. This proposal is aimed at defining what that operator should look like, and adding it to Kubeflow.

Pytorch operator proposal #33

Pytorch operator proposal #33

Conversation

jose5918 commented Mar 8, 2018 • edited by jlewi Loading

gaocegege commented Mar 8, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jlewi commented Mar 8, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jose5918 commented Mar 9, 2018

jlewi commented Mar 20, 2018

k8s-ci-robot commented Mar 20, 2018

jlewi commented Mar 20, 2018

soumith commented Mar 23, 2018

jose5918 commented Mar 8, 2018 •

edited by jlewi

Loading