-
-
Notifications
You must be signed in to change notification settings - Fork 144
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Restart cluster job on task completion #597
Comments
Hi @EvanKomp, thanks for raising this precise issue. The need you talk about as been discussed (I think) in #416. There is an issue opened in distributed: dask/distributed#3141. I think the best would be to implement an option to be able to use It would be very welcomed if you could contribute on this, as this is clearly a nice feature for HPC systems. |
@guillaumeeb I think that is a reasonable strategy, and pretty flexible. If you end up with an extreme case like mine where tasks can take order of magnitude differences in time, your proposed solution could be pushed to the extreme and have the "lifetime" (with additional waiting for task completion) to be just a few minutes, and all but guarantee that any task that is not the lower most extreme of run times will cause the worker to end after the task completes. I can draft up a solution at some point, but I am new to dask so could use help specifically identifying the safest mechanism to actually send the signal to kill a worker gracefully from within the worker. I initially tried the same one that the |
I think you should post these questions directly in the corresponding distributed issue, I've never taken a look at the |
Use case conditions:
Current behavior:
Compute is wasted on tasks that cannot finish in remaining walltime, task is restarted from scratch on new worker after worker death.
Originally working with this issue on the forums here
More context:
Task executes a 3rd party software (mine requiring multiple threads, see the
issues
linked in the above post for examples). Code looks something like the following:Result of
worker_XXX.log
Worker XXX is killed at 12:29 due to walltime. 7 minutes of compute is wasted because the state cannot be changed. Task 10 starts from scratch on a new worker.
Attempts to fix:
Short of figuring out a way to move the execution state, I figure the best strategy is to have each task get a brand new SLURM job, so that no compute is wasted and any task that can finish in the walltime works.
I think this should be codified somehow as the "solution" above is quite hacky. My intuition says that it would fit best as a scheduler plugin, as using the worker plugin above clearly had adverse effects on task balancing. Happy to help contribute with some input on where this would fit best if it would be a useful addition.
The text was updated successfully, but these errors were encountered: