Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dask-worker not handling KeyboardInterrupt correctly #2788

Closed
TomAugspurger opened this issue Jun 19, 2019 · 4 comments
Closed

dask-worker not handling KeyboardInterrupt correctly #2788

TomAugspurger opened this issue Jun 19, 2019 · 4 comments

Comments

@TomAugspurger
Copy link
Member

Connect a dask-worker to the scheduler and then ctrl-c.

That should exit cleanly.

2019-06-19 14:53:49,386 distributed.worker[54182] INFO -------------------------------------------------
2019-06-19 14:53:49,393 distributed.worker[54182] INFO         Registered to:    tcp://192.168.7.20:8786
2019-06-19 14:53:49,393 distributed.worker[54182] INFO -------------------------------------------------
2019-06-19 14:53:49,394 distributed.core[54182] INFO Starting established connection
^C2019-06-19 14:53:51,525 distributed.dask_worker[54155] INFO Exiting on signal 2
2019-06-19 14:53:51,526 distributed.nanny[54155] INFO Closing Nanny at 'tcp://192.168.7.20:62826'
2019-06-19 14:53:51,528 distributed.dask_worker[54155] INFO End worker
Traceback (most recent call last):
  File "/Users/taugspurger/.virtualenvs/dask-dev/bin/dask-worker", line 11, in <module>
    load_entry_point('distributed', 'console_scripts', 'dask-worker')()
  File "/Users/taugspurger/sandbox/distributed/distributed/cli/dask_worker.py", line 387, in go
    main()
  File "/Users/taugspurger/Envs/dask-dev/lib/python3.7/site-packages/click/core.py", line 764, in __call__
    return self.main(*args, **kwargs)
  File "/Users/taugspurger/Envs/dask-dev/lib/python3.7/site-packages/click/core.py", line 717, in main
    rv = self.invoke(ctx)
  File "/Users/taugspurger/Envs/dask-dev/lib/python3.7/site-packages/click/core.py", line 956, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/Users/taugspurger/Envs/dask-dev/lib/python3.7/site-packages/click/core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "/Users/taugspurger/sandbox/distributed/distributed/cli/dask_worker.py", line 380, in main
    raise TimeoutError("Timed out starting worker.") from None
tornado.util.TimeoutError: Timed out starting worker.
2019-06-19 14:53:51,531 distributed.process[54155] WARNING reaping stray process <ForkServerProcess(Dask Worker process (from Nanny), started daemon)>

This is related to my PR. Will take a look later.

@TomAugspurger
Copy link
Member Author

I looked into this a bit today.

I think there's a bug in

def on_signal(signum):
. That should be a gen.coroutine, else any yield within that call stack will immediately exit the handler (tested with yield gen.sleep(0) in the close_all right above that. Anything after the yield gen.sleep(0) isn't run.

But even after fixing that, I'm still seeing a ctrl-c cause a TImeoutError. Will come back to this later.

@mrocklin
Copy link
Member

Additionally, we might consider having SIGINT call something like the following in order to cleanly move data away:

worker.scheduler.close_workers(..., workers=[self.address])

cc @jcrist @jakirkham @andersy005 @Carreau

@Carreau
Copy link
Contributor

Carreau commented Mar 9, 2020

According to slurm documentation processes will be sent in order SIGCONT, SIGTERM, then SIGKILL when on a preemptible queue. I'm guessing sigcont as they might already be suspended. So maybe we ant to also trigger this on sigterm.

@jakirkham jakirkham removed their assignment Mar 9, 2020
@TomAugspurger
Copy link
Member Author

FWIW, I don't see this any more.

bash-5.0$ dask-worker tcp://192.168.7.20:8786
distributed.nanny - INFO -         Start Nanny at: 'tcp://192.168.7.20:61655'
distributed.worker - INFO -       Start worker at:   tcp://192.168.7.20:61657
distributed.worker - INFO -          Listening to:   tcp://192.168.7.20:61657
distributed.worker - INFO -          dashboard at:         192.168.7.20:61656
distributed.worker - INFO - Waiting to connect to:    tcp://192.168.7.20:8786
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -               Threads:                          8
distributed.worker - INFO -                Memory:                   17.18 GB
distributed.worker - INFO -       Local Directory: /Users/taugspurger/sandbox/distributed/dask-worker-space/worker-r_2ftwxz
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -         Registered to:    tcp://192.168.7.20:8786
distributed.worker - INFO - -------------------------------------------------
distributed.core - INFO - Starting established connection
^Cdistributed.nanny - INFO - Closing Nanny at 'tcp://192.168.7.20:61655'
distributed.dask_worker - INFO - End worker
bash-5.0$ echo $?
0

Closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants