Pytorch v1alpha2 api implementation #54

johnugeorge · 2018-08-27T09:37:18Z

This contains v1alpha2 version of Pytorch. It contains the following changes

PytorchJob v1alpha2 spec consistent with TFJob v1alpha2 . Fixes v1alpha2 pytorch API should try to be consistent with TFJob #49
Shares implementation with TF operator for common libraries like JobController, logger, control
Moved glide to dep for better dependency management.

Examples have to be added in separate PRs

This change is

coveralls · 2018-08-27T09:57:06Z

Coverage increased (+16.4%) to 62.553% when pulling 3df7fdc on johnugeorge:v1alpha2 into 12ce79b on kubeflow:master.

johnugeorge · 2018-08-27T09:57:40Z

/retest

johnugeorge · 2018-08-28T16:39:21Z

@jose5918

jose5918 · 2018-08-28T16:58:14Z

@johnugeorge LGTM I just want to verify since I'm not familiar with the v1alpha2 implementation. What is the retry logic like in this operator?

johnugeorge · 2018-08-28T17:44:47Z

@jose5918 I am not sure if I understand your question. Retry logic if the pod failed?

johnugeorge · 2018-08-28T17:51:34Z

For v1alpha2 implementation, there are two steps involved.

Refactored the tf-operator code to create a base job controller class(for sharing functionalities across operators. https://github.com/kubeflow/tf-operator/tree/master/pkg/controller.v2/jobcontroller

Specs are kept different for every operator. More code can be shared in the future.

Add operator specific code in individual repos.

jose5918 · 2018-08-28T17:57:33Z

I guess I was wondering if this is true
https://github.com/johnugeorge/pytorch-operator/blob/v1alpha2/pkg/apis/pytorch/v1alpha2/types.go#L109

I remember some cases where there were retryable errors (like connecting to the master) and the exit codes were below 127 for pytorch distributed

johnugeorge · 2018-08-28T18:03:37Z

Exit codes in https://github.com/kubeflow/tf-operator/blob/master/pkg/util/train/train_util.go is used to check if it has to be retried after a failure

johnugeorge · 2018-08-28T18:04:21Z

I will add a separate issue to track this.

jose5918 · 2018-08-28T18:24:28Z

/approve

jose5918 · 2018-08-28T18:24:34Z

/lgtm

k8s-ci-robot · 2018-08-28T18:24:38Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jose5918

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [jose5918]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

jose5918 · 2018-08-28T18:24:57Z

have verified with @johnugeorge that restart behavior is similar to before

Pytorcxh v1alpha2 api implementation

be74886

k8s-ci-robot added do-not-merge/work-in-progress size/XXL labels Aug 27, 2018

k8s-ci-robot requested review from elsonrodriguez and jose5918 August 27, 2018 09:37

Adding unit tests

3df7fdc

johnugeorge changed the title ~~WIP: Pytorch v1alpha2 api implementation~~ Pytorch v1alpha2 api implementation Aug 28, 2018

k8s-ci-robot removed the do-not-merge/work-in-progress label Aug 28, 2018

k8s-ci-robot assigned jose5918 Aug 28, 2018

k8s-ci-robot added the lgtm label Aug 28, 2018

k8s-ci-robot added the approved label Aug 28, 2018

k8s-ci-robot merged commit d657ace into kubeflow:master Aug 28, 2018

akgraner mentioned this pull request Jan 2, 2024

Kubeflow Steering Committee Elections - Testimonial Phase - Johnu George kubeflow/community#675

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pytorch v1alpha2 api implementation #54

Pytorch v1alpha2 api implementation #54

johnugeorge commented Aug 27, 2018 •

edited

Loading

coveralls commented Aug 27, 2018 •

edited

Loading

johnugeorge commented Aug 27, 2018

johnugeorge commented Aug 28, 2018

jose5918 commented Aug 28, 2018

johnugeorge commented Aug 28, 2018

johnugeorge commented Aug 28, 2018

jose5918 commented Aug 28, 2018

johnugeorge commented Aug 28, 2018

johnugeorge commented Aug 28, 2018

jose5918 commented Aug 28, 2018

jose5918 commented Aug 28, 2018

k8s-ci-robot commented Aug 28, 2018

jose5918 commented Aug 28, 2018

Pytorch v1alpha2 api implementation #54

Pytorch v1alpha2 api implementation #54

Conversation

johnugeorge commented Aug 27, 2018 • edited Loading

coveralls commented Aug 27, 2018 • edited Loading

johnugeorge commented Aug 27, 2018

johnugeorge commented Aug 28, 2018

jose5918 commented Aug 28, 2018

johnugeorge commented Aug 28, 2018

johnugeorge commented Aug 28, 2018

jose5918 commented Aug 28, 2018

johnugeorge commented Aug 28, 2018

johnugeorge commented Aug 28, 2018

jose5918 commented Aug 28, 2018

jose5918 commented Aug 28, 2018

k8s-ci-robot commented Aug 28, 2018

jose5918 commented Aug 28, 2018

johnugeorge commented Aug 27, 2018 •

edited

Loading

coveralls commented Aug 27, 2018 •

edited

Loading