Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When setting restartPolicy to OnFailure in PyTorchJob, is there something like maxRetartCount #1589

Open
zhiyxu opened this issue May 7, 2022 · 3 comments

Comments

@zhiyxu
Copy link

zhiyxu commented May 7, 2022

When restartPolicy is set to OnFailure in PyTorchJob, if the worker always failed, it will be restarted continueously.
I would like to know if there is a configuration like maxRestartCount, if worker restart count reaches the limit, the PyTorchJob just fail directly and release resources.

@github-actions
Copy link

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@tenzen-y
Copy link
Member

Maybe, we can support this feature once #1718 is done.
(batch/v1 Job backoffLimitPerIndex)

/lifecycle frozen

@tenzen-y
Copy link
Member

Currently, we can apply backOffLimit to the entire Job:

// Optional number of retries before marking this job failed.
// +optional
BackoffLimit *int32 `json:"backoffLimit,omitempty"`

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants