Skip to content
This repository has been archived by the owner on Sep 19, 2022. It is now read-only.

Pytorch v1alpha2 api implementation #54

Merged
merged 2 commits into from
Aug 28, 2018

Conversation

johnugeorge
Copy link
Member

@johnugeorge johnugeorge commented Aug 27, 2018

This contains v1alpha2 version of Pytorch. It contains the following changes

  1. PytorchJob v1alpha2 spec consistent with TFJob v1alpha2 . Fixes v1alpha2 pytorch API should try to be consistent with TFJob #49
  2. Shares implementation with TF operator for common libraries like JobController, logger, control
  3. Moved glide to dep for better dependency management.

Examples have to be added in separate PRs


This change is Reviewable

@coveralls
Copy link

coveralls commented Aug 27, 2018

Coverage Status

Coverage increased (+16.4%) to 62.553% when pulling 3df7fdc on johnugeorge:v1alpha2 into 12ce79b on kubeflow:master.

@johnugeorge
Copy link
Member Author

/retest

@johnugeorge johnugeorge changed the title WIP: Pytorch v1alpha2 api implementation Pytorch v1alpha2 api implementation Aug 28, 2018
@johnugeorge
Copy link
Member Author

@jose5918

@jose5918
Copy link
Contributor

@johnugeorge LGTM I just want to verify since I'm not familiar with the v1alpha2 implementation. What is the retry logic like in this operator?

@johnugeorge
Copy link
Member Author

@jose5918 I am not sure if I understand your question. Retry logic if the pod failed?

@johnugeorge
Copy link
Member Author

For v1alpha2 implementation, there are two steps involved.

  1. Refactored the tf-operator code to create a base job controller class(for sharing functionalities across operators. https://github.com/kubeflow/tf-operator/tree/master/pkg/controller.v2/jobcontroller

Specs are kept different for every operator. More code can be shared in the future.

  1. Add operator specific code in individual repos.

@jose5918
Copy link
Contributor

I guess I was wondering if this is true
https://github.com/johnugeorge/pytorch-operator/blob/v1alpha2/pkg/apis/pytorch/v1alpha2/types.go#L109

I remember some cases where there were retryable errors (like connecting to the master) and the exit codes were below 127 for pytorch distributed

@johnugeorge
Copy link
Member Author

Exit codes in https://github.com/kubeflow/tf-operator/blob/master/pkg/util/train/train_util.go is used to check if it has to be retried after a failure

@johnugeorge
Copy link
Member Author

I will add a separate issue to track this.

@jose5918
Copy link
Contributor

/approve

@jose5918
Copy link
Contributor

/lgtm

@k8s-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jose5918

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot merged commit d657ace into kubeflow:master Aug 28, 2018
@jose5918
Copy link
Contributor

have verified with @johnugeorge that restart behavior is similar to before

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants