Workers only connect to scheduler when cluster started in IPython #293

salotz · 2019-07-20T23:34:54Z

Perhaps I am missing something obvious here but I wrote a little script to get something up and running:


if __name__ == "__main__":

    import sys

    from dask_jobqueue import SLURMCluster

    num_workers = int(sys.argv[1])

    cluster = SLURMCluster(project='dicksonlab',
                           cores=1,
                           walltime="00:05:00",
                           memory='3 GB',
                           processes=1,
                           interface='ib0')
    cluster.scale(num_workers)

    print(cluster.address)

If I execute this from an IPython session (like in every demo I've seen) everything is okay and the logs of my worker jobs show that they have connected.

However, if I just execute this script (also tried not in the __name__ guard) then it all starts and runs (and suspiciously returns the prompt), but the workers never connect and eventually timeout.

distributed.worker - INFO - Waiting to connect to:      tcp://10.3.8.48:38990

After looking at the source I noticed the remarks that mention this is a planned feature https://github.com/dask/dask-jobqueue/blob/master/dask_jobqueue/deploy/cluster_manager.py#L54

I still think this is a noteworthy consequence of that problem as it points out that the tool is reallly tied to IPython and/or Jupyter notebooks, which I don't really use. At least a warning in the docs would help for now though, that is unless someone has a workaround.

Cheers, and thanks for all the hard work on this, really makes my life with SLURM et al. much easier.

~Sam

The text was updated successfully, but these errors were encountered:

willirath · 2019-07-21T07:10:58Z

`cluster.scale` does not block until the workers are started. So your script submits the slurm jobs, prints the address of the (still empty cluster), and exits. Am 21. Juli 2019 01:34:59 schrieb salotz <notifications@github.com>:

…

Perhaps I am missing something obvious here but I wrote a little script to get something up and running: ``` if __name__ == "__main__": import sys from dask_jobqueue import SLURMCluster num_workers = int(sys.argv[1]) cluster = SLURMCluster(project='dicksonlab', cores=1, walltime="00:05:00", memory='3 GB', processes=1, interface='ib0') cluster.scale(num_workers) print(cluster.address) ``` If I execute this from an IPython session (like in every demo I've seen) everything is okay and the logs of my worker jobs show that they have connected. However, if I just execute this script (also tried not in the `__name__` guard) then it all starts and runs (and suspiciously returns the prompt), but the workers never connect and eventually timeout. ``` distributed.worker - INFO - Waiting to connect to: tcp://10.3.8.48:38990 ``` After looking at the source I noticed the remarks that mention this is a planned feature https://github.com/dask/dask-jobqueue/blob/master/dask_jobqueue/deploy/cluster_manager.py#L54 I still think this is a noteworthy consequence of that problem as it points out that the tool is reallly tied to IPython and/or Jupyter notebooks, which I don't really use. At least a warning in the docs would help for now though, that is unless someone has a workaround. Cheers, and thanks for all the hard work on this, really makes my life with SLURM et al. much easier. ~Sam -- You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub: #293

lesteve · 2019-07-22T04:57:18Z

If you want to wait for workers to be available, you can use client.wait_for_workers(n=num_workers) since distributed >= 2 or do it by hand following
dask/distributed#2138 (comment) for example.

More generally, when you run a script using dask you need to have about something that blocks until you get the result you need (e.g. using .result or client.gather). Otherwise, since the dask scheduler lives in your main process, when your script finishes (generally quicker than you expect), the dask scheduler disappears and your dask workers will kill themselves after death-timeout because they are unable to connect to the scheduler.

To be honest this is a caveat when you start with dask and you run python scripts. If you see a good way to add this to the documentation (probably in the distributed project), you are more encouraged to do so!

For the record, dask and its subprojects are not tied to IPython or Jupyter notebook.

I am going to close the issue, @salotz feel free to comment if you feel your answer has not been answered to its full extent.

salotz · 2019-07-23T00:02:32Z

Thanks for the suggestions! One of these will work for me.

lesteve closed this as completed Jul 22, 2019

lesteve mentioned this issue Jul 25, 2019

Cobalt Support #298

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Workers only connect to scheduler when cluster started in IPython #293

Workers only connect to scheduler when cluster started in IPython #293

salotz commented Jul 20, 2019

willirath commented Jul 21, 2019 via email

lesteve commented Jul 22, 2019

salotz commented Jul 23, 2019

Workers only connect to scheduler when cluster started in IPython #293

Workers only connect to scheduler when cluster started in IPython #293

Comments

salotz commented Jul 20, 2019

willirath commented Jul 21, 2019 via email

lesteve commented Jul 22, 2019

salotz commented Jul 23, 2019