Blocking restart behavior #684

ahg-g · 2024-10-18T16:09:42Z

What would you like to be added:

A blocking restart behavior where child jobs are created only after all jobs from the previous attempt are fully deleted.

Why is this needed:
Currently when restarting a JobSet, the child Jobs are recreated in parallel, and so one child job may get recreated before another is fully deleted. This can cause connection problems for training workloads where the new job tries to connect to a soon to be deleted pod, which will trigger another failure.

We can do this by adding a "JobSetRestartPolicy" to the FailurePolicy struct. The two values would be Recreate|BlockingRecreate, where the default is Recreate

This enhancement requires the following artifacts:

Design doc
API change
Docs update

The artifacts should be linked in subsequent comments.

The text was updated successfully, but these errors were encountered:

nstogner · 2024-10-18T16:26:43Z

Taking a stab at this

kannon92 · 2024-10-18T18:40:56Z

/assign @nstogner

nstogner · 2024-10-22T19:01:07Z

Breadcrumbs to: #665 (comment)

nstogner mentioned this issue Oct 18, 2024

Add restart strategy #686

Merged

k8s-ci-robot assigned nstogner Oct 18, 2024

danielvegamyhre mentioned this issue Oct 22, 2024

Release v0.7.0 requirements #683

Closed

9 tasks

k8s-ci-robot closed this as completed in #686 Oct 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Blocking restart behavior #684

Blocking restart behavior #684

ahg-g commented Oct 18, 2024 •

edited

Loading

nstogner commented Oct 18, 2024

kannon92 commented Oct 18, 2024

nstogner commented Oct 22, 2024

Blocking restart behavior #684

Blocking restart behavior #684

Comments

ahg-g commented Oct 18, 2024 • edited Loading

nstogner commented Oct 18, 2024

kannon92 commented Oct 18, 2024

nstogner commented Oct 22, 2024

ahg-g commented Oct 18, 2024 •

edited

Loading