-
Notifications
You must be signed in to change notification settings - Fork 143
Conversation
/retest |
@johnugeorge LGTM I just want to verify since I'm not familiar with the v1alpha2 implementation. What is the retry logic like in this operator? |
@jose5918 I am not sure if I understand your question. Retry logic if the pod failed? |
For v1alpha2 implementation, there are two steps involved.
Specs are kept different for every operator. More code can be shared in the future.
|
I guess I was wondering if this is true I remember some cases where there were retryable errors (like connecting to the master) and the exit codes were below 127 for pytorch distributed |
Exit codes in https://github.com/kubeflow/tf-operator/blob/master/pkg/util/train/train_util.go is used to check if it has to be retried after a failure |
I will add a separate issue to track this. |
/approve |
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: jose5918 The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
have verified with @johnugeorge that restart behavior is similar to before |
This contains v1alpha2 version of Pytorch. It contains the following changes
Examples have to be added in separate PRs
This change is