Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Blocking restart behavior #684

Closed
2 of 3 tasks
Tracked by #683
ahg-g opened this issue Oct 18, 2024 · 3 comments · Fixed by #686
Closed
2 of 3 tasks
Tracked by #683

Blocking restart behavior #684

ahg-g opened this issue Oct 18, 2024 · 3 comments · Fixed by #686
Assignees

Comments

@ahg-g
Copy link
Contributor

ahg-g commented Oct 18, 2024

What would you like to be added:

A blocking restart behavior where child jobs are created only after all jobs from the previous attempt are fully deleted.

Why is this needed:
Currently when restarting a JobSet, the child Jobs are recreated in parallel, and so one child job may get recreated before another is fully deleted. This can cause connection problems for training workloads where the new job tries to connect to a soon to be deleted pod, which will trigger another failure.

We can do this by adding a "JobSetRestartPolicy" to the FailurePolicy struct. The two values would be Recreate|BlockingRecreate, where the default is Recreate

This enhancement requires the following artifacts:

  • Design doc
  • API change
  • Docs update

The artifacts should be linked in subsequent comments.

@nstogner
Copy link
Contributor

Taking a stab at this

@kannon92
Copy link
Contributor

/assign @nstogner

@nstogner
Copy link
Contributor

Breadcrumbs to: #665 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants