-
Notifications
You must be signed in to change notification settings - Fork 700
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PytorchJob restartPolicy: ExitCode does not honor backoffLimit for retryable errors #2045
Comments
On reviewing this again, there's some possible solutions I can think of:
My understanding is the date for (2) is unknown and is pending job success policy becoming beta in Kubernetes (no date). Selfishly, I'd like to use this feature sooner than later, so if there's any other short term fix that would be great. What do you think about 1). ? |
This is actually a specification, not a bug. You can see a similar feature in the batch/v1 Job w/ PodFailurePolicy |
Since selecting (1) means breaking change, we can not select the approach. |
/kind feature |
Hi @tenzen-y Are you going to work on this API change ? or how far have you started so far. |
TBH, I would like not to add this feature to v1 API because we need to re-implement the same feature in the training-operator. |
@tenzen-y |
As I mentioned here #2045 (comment), changing the current behavior without another knob brings us breaking changes. So, I don't think it could be. |
If I set the FailurePolicy to OnFailure in the PyTorchJob, it restarts until backoffLimit is met. To me, the current behavior seems like a bug rather than a feature as far as user experience goes. Changing how the restarts are counted for ExitCode seems like it would be a fix to unexpected behavior, not introducing another knob. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
This issue has been automatically closed because it has not had recent activity. Please comment "/reopen" to reopen it. |
Steps to reproduce:
Observed Behavior:
When the pod errors out, the controller deletes the pod and recreates it. This is done indefinitely until the pod completes successfully (if it ever does!). It ignores the backoffLimit.
Analysis:
The restart behavior for "OnFailure" vs "ExitCode" is different in that "OnFailure". For "OnFailure", the Pod's actual restart policy is set to "OnFailure", leaving the K8s pod controller to do the restarts. In the case of "ExitCode", then pod's restart policy is set to "Never", and therefore the restarts are controlled via deletion of the pod. As far as I can tell, I think this is at least partially due to the fact that the code that checks for exceeding backoff limit keys off of container restarts. Since in this case it is deleting and recreating pods, there are no container restarts.
The text was updated successfully, but these errors were encountered: