A failsafe for hung adaptive clusters #6825

crusaderky · 2022-08-04T09:18:48Z

#5999 (comment) identified a use case where a cluster, while being somewhat active, will accrue no progress whatsoever.

Such a use case is extremely expensive for adaptive clusters; e.g. someone might start a 2-hours run on friday night, go home for the weekend, and find on monday morning that the whole cluster remained active for the whole time, costing $$$.

Proposed design

Implement a new, fairly long (e.g. 1h by default) timeout in the scheduler, which

starts when any task becomes pending or executing
stops when no tasks are pending or executing
is reset when any task completes

When that timeout expires, all pending or executing tasks are marked as failed. This in turn must release any in-memory dependent tasks and let the cluster shrink down.

Note that this design will also kill off runs that are blocked due to missing a worker with specific resources.

crusaderky mentioned this issue Aug 4, 2022

Restart paused workers after a certain timeout #5999

Open

fjetter added the adaptive All things relating to adaptive scaling label Aug 26, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A failsafe for hung adaptive clusters #6825

A failsafe for hung adaptive clusters #6825

crusaderky commented Aug 4, 2022 •

edited

Loading

A failsafe for hung adaptive clusters #6825

A failsafe for hung adaptive clusters #6825

Comments

crusaderky commented Aug 4, 2022 • edited Loading

Proposed design

crusaderky commented Aug 4, 2022 •

edited

Loading