Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A failsafe for hung adaptive clusters #6825

Open
crusaderky opened this issue Aug 4, 2022 · 0 comments
Open

A failsafe for hung adaptive clusters #6825

crusaderky opened this issue Aug 4, 2022 · 0 comments
Labels
adaptive All things relating to adaptive scaling

Comments

@crusaderky
Copy link
Collaborator

crusaderky commented Aug 4, 2022

#5999 (comment) identified a use case where a cluster, while being somewhat active, will accrue no progress whatsoever.

Such a use case is extremely expensive for adaptive clusters; e.g. someone might start a 2-hours run on friday night, go home for the weekend, and find on monday morning that the whole cluster remained active for the whole time, costing $$$.

Proposed design

Implement a new, fairly long (e.g. 1h by default) timeout in the scheduler, which

  • starts when any task becomes pending or executing
  • stops when no tasks are pending or executing
  • is reset when any task completes

When that timeout expires, all pending or executing tasks are marked as failed. This in turn must release any in-memory dependent tasks and let the cluster shrink down.

Note that this design will also kill off runs that are blocked due to missing a worker with specific resources.

@fjetter fjetter added the adaptive All things relating to adaptive scaling label Aug 26, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
adaptive All things relating to adaptive scaling
Projects
None yet
Development

No branches or pull requests

2 participants