Close workers gracefully #2892

mrocklin · 2019-07-27T01:03:06Z

This allows workers to optionally terminate themselves gracefully after a predetermined time. This can be helpful in a few contexts:

We receive a SIGINT, and know that we need to clean up quickly (though note that the signal handlers are not implemented as part of this PR (cc @jni))
We know that we'll be kicked off at a certain time, such as in one hour from now, as is often specified by HPC job schedulers (cc @lesteve @guillaumeeb @jhamman )
We just want to refresh our workers every once in a while, because we know that our code leaks some memory . Fixes Gracefully restart a worker from a nanny #2861

This is configurable as keywords to the Worker or Nanny classes, in config values, or with CLI. Here is an example with CLI.

Restart to clear state

dask-worker scheduler:8786 --lifetime 1hr --lifetime-restart --lifetime-stagger 5m

This will kill the worker roughly 1 hour from now +- a range of 5 minutes (to avoid killing all of our workers at the same time). It will also allow that worker to be restarted afterwards

Restart to avoid walltime death

dask-worker scheduler:8786 --lifetime 58m

Here we don't try to restart the worker (no point) and we choose a time a bit before our 60m walltime.

Also cc @jacobtomlinson @jcrist for general deployment information

For folks who want to review. I recommend going commit-by-commit (it should be fairly clean). (this is also a decently simple/educational PR to review)

…e-gracefully

mrocklin · 2019-07-29T20:54:35Z

Merging this tomorrow if there are no further comments (Although comments would be quite welcome here)

TomAugspurger · 2019-07-29T22:23:27Z

though note that the signal handlers are not implemented as part of this PR

Do you anticipate this making fixing #2788 easier? Your comment in #2788 (comment) makes it sound like "yes".

TomAugspurger

Implementation looks good at a glance.

mrocklin · 2019-07-29T23:16:58Z

Do you anticipate this making fixing #2788 easier?

It's the action that I think we would want to take when handling a SIGINT.

mrocklin · 2019-07-29T23:17:40Z

Merging this shortly if there are no further comments

guillaumeeb

Thanks for this @mrocklin, just a fiew comment on the documentation.

distributed/worker.py

distributed/cli/dask_worker.py

Co-Authored-By: Guillaume Eynard-Bontemps <g.eynard.bontemps@gmail.com>

Now that we've added --lifetime-restart this is more clear there

…uted into worker-close-gracefully

lesteve · 2019-08-29T08:33:35Z

Just a quick comment (I don't think this is a problem for this PR, since the signal handling is only part of it AFAICT): for SGE the signal you get when the job exceed the walltime is a SIGKILL (e.g. kill -9). Not sure for other job schedulers.

jacobtomlinson · 2019-08-29T08:48:59Z

Ouch! AFAIK Slurm and PBS send a SIGTERM shortly before the SIGKILL to give the process a chance to exit cleanly.

lesteve · 2019-08-29T09:03:55Z

Good to know that other schedulers are a bit nicer regarding job termination.

I found this which seems to indicate that for SGE SIGKILL is used by default but that you can configure it to be something else.

mrocklin added 6 commits July 26, 2019 17:03

Add Worker.close_gracefully method

7d34285

Add lifetime keyword to Worker

7a163e0

Add lifetime to nanny and CLI

0b72585

Add --lifetime-stagger keyword

eec7754

Merge branch 'master' of github.com:dask/distributed into worker-clos…

fc607ef

…e-gracefully

mark test as slow

9124c18

TomAugspurger reviewed Jul 29, 2019

View reviewed changes

jni mentioned this pull request Jul 30, 2019

Handling workers with expiring allocation requests dask/dask-jobqueue#122

Closed

guillaumeeb approved these changes Jul 30, 2019

View reviewed changes

distributed/worker.py Outdated Show resolved Hide resolved

distributed/cli/dask_worker.py Outdated Show resolved Hide resolved

mrocklin and others added 4 commits July 30, 2019 07:02

Update distributed/worker.py [skip ci]

aa4c5e5

Co-Authored-By: Guillaume Eynard-Bontemps <g.eynard.bontemps@gmail.com>

remove warning in --lifetime help string

13455bd

Now that we've added --lifetime-restart this is more clear there

Merge branch 'worker-close-gracefully' of github.com:mrocklin/distrib…

dfb810c

…uted into worker-close-gracefully

Merge branch 'master' into worker-close-gracefully

2061fd4

mrocklin force-pushed the worker-close-gracefully branch from 87c3082 to 2061fd4 Compare July 30, 2019 22:02

Close a Client if it doesn't startup in time

3e7aa67

mrocklin force-pushed the worker-close-gracefully branch from e11eaed to 3e7aa67 Compare July 31, 2019 00:20

mrocklin added 2 commits July 30, 2019 18:39

Default lifetime_stagger to None so that we get the value from config

ff97b21

cleanup test_bad_tasks_fail

b5c9051

mrocklin merged commit 5f12043 into dask:master Jul 31, 2019

mrocklin deleted the worker-close-gracefully branch July 31, 2019 14:37

mrocklin mentioned this pull request Nov 19, 2019

RFC Retire worker with periodic callback #3248

Closed

wshanks mentioned this pull request Apr 24, 2020

Soft time limit for workers? dask/dask-jobqueue#416

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Close workers gracefully #2892

Close workers gracefully #2892

mrocklin commented Jul 27, 2019

mrocklin commented Jul 29, 2019

TomAugspurger commented Jul 29, 2019 •

edited

Loading

TomAugspurger left a comment

mrocklin commented Jul 29, 2019

mrocklin commented Jul 29, 2019

guillaumeeb left a comment

lesteve commented Aug 29, 2019 •

edited

Loading

jacobtomlinson commented Aug 29, 2019

lesteve commented Aug 29, 2019

Close workers gracefully #2892

Close workers gracefully #2892

Conversation

mrocklin commented Jul 27, 2019

Restart to clear state

Restart to avoid walltime death

mrocklin commented Jul 29, 2019

TomAugspurger commented Jul 29, 2019 • edited Loading

TomAugspurger left a comment

Choose a reason for hiding this comment

mrocklin commented Jul 29, 2019

mrocklin commented Jul 29, 2019

guillaumeeb left a comment

Choose a reason for hiding this comment

lesteve commented Aug 29, 2019 • edited Loading

jacobtomlinson commented Aug 29, 2019

lesteve commented Aug 29, 2019

TomAugspurger commented Jul 29, 2019 •

edited

Loading

lesteve commented Aug 29, 2019 •

edited

Loading