You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
#5999 (comment) identified a use case where a cluster, while being somewhat active, will accrue no progress whatsoever.
Such a use case is extremely expensive for adaptive clusters; e.g. someone might start a 2-hours run on friday night, go home for the weekend, and find on monday morning that the whole cluster remained active for the whole time, costing $$$.
Proposed design
Implement a new, fairly long (e.g. 1h by default) timeout in the scheduler, which
starts when any task becomes pending or executing
stops when no tasks are pending or executing
is reset when any task completes
When that timeout expires, all pending or executing tasks are marked as failed. This in turn must release any in-memory dependent tasks and let the cluster shrink down.
Note that this design will also kill off runs that are blocked due to missing a worker with specific resources.
The text was updated successfully, but these errors were encountered:
#5999 (comment) identified a use case where a cluster, while being somewhat active, will accrue no progress whatsoever.
Such a use case is extremely expensive for adaptive clusters; e.g. someone might start a 2-hours run on friday night, go home for the weekend, and find on monday morning that the whole cluster remained active for the whole time, costing $$$.
Proposed design
Implement a new, fairly long (e.g. 1h by default) timeout in the scheduler, which
When that timeout expires, all pending or executing tasks are marked as failed. This in turn must release any in-memory dependent tasks and let the cluster shrink down.
Note that this design will also kill off runs that are blocked due to missing a worker with specific resources.
The text was updated successfully, but these errors were encountered: