You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
A blocking restart behavior where child jobs are created only after all jobs from the previous attempt are fully deleted.
Why is this needed:
Currently when restarting a JobSet, the child Jobs are recreated in parallel, and so one child job may get recreated before another is fully deleted. This can cause connection problems for training workloads where the new job tries to connect to a soon to be deleted pod, which will trigger another failure.
We can do this by adding a "JobSetRestartPolicy" to the FailurePolicy struct. The two values would be Recreate|BlockingRecreate, where the default is Recreate
This enhancement requires the following artifacts:
Design doc
API change
Docs update
The artifacts should be linked in subsequent comments.
The text was updated successfully, but these errors were encountered:
What would you like to be added:
A blocking restart behavior where child jobs are created only after all jobs from the previous attempt are fully deleted.
Why is this needed:
Currently when restarting a JobSet, the child Jobs are recreated in parallel, and so one child job may get recreated before another is fully deleted. This can cause connection problems for training workloads where the new job tries to connect to a soon to be deleted pod, which will trigger another failure.
We can do this by adding a "JobSetRestartPolicy" to the FailurePolicy struct. The two values would be
Recreate|BlockingRecreate
, where the default isRecreate
This enhancement requires the following artifacts:
The artifacts should be linked in subsequent comments.
The text was updated successfully, but these errors were encountered: