-
Notifications
You must be signed in to change notification settings - Fork 441
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Trial stuck running - job failed backoff limit reached #845
Labels
Comments
WorkaroundI deleted the stuck trials and Katib spawned new trials. So manually deleting the stuck trials appears to unjam the experiment. |
/assign |
Thanks for the issue. I will fix it soon. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
/kind bug
I submitted a Katib job. Some of the trials ended up launching trials for which the corresponding training job reached the backoff limit and thus will never succeed.
Yet the trial remains in the running state.
Here's the job spec. I elided some details but left the status to show the job is in the failed state.
Here's the corresponding trial
So the status is stuck in running state but it should be marked as failed since the job will never succeed.
The text was updated successfully, but these errors were encountered: