Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Workers only connect to scheduler when cluster started in IPython #293

Closed
salotz opened this issue Jul 20, 2019 · 3 comments
Closed

Workers only connect to scheduler when cluster started in IPython #293

salotz opened this issue Jul 20, 2019 · 3 comments

Comments

@salotz
Copy link

salotz commented Jul 20, 2019

Perhaps I am missing something obvious here but I wrote a little script to get something up and running:


if __name__ == "__main__":

    import sys

    from dask_jobqueue import SLURMCluster

    num_workers = int(sys.argv[1])

    cluster = SLURMCluster(project='dicksonlab',
                           cores=1,
                           walltime="00:05:00",
                           memory='3 GB',
                           processes=1,
                           interface='ib0')
    cluster.scale(num_workers)

    print(cluster.address)

If I execute this from an IPython session (like in every demo I've seen) everything is okay and the logs of my worker jobs show that they have connected.

However, if I just execute this script (also tried not in the __name__ guard) then it all starts and runs (and suspiciously returns the prompt), but the workers never connect and eventually timeout.

distributed.worker - INFO - Waiting to connect to:      tcp://10.3.8.48:38990

After looking at the source I noticed the remarks that mention this is a planned feature https://github.com/dask/dask-jobqueue/blob/master/dask_jobqueue/deploy/cluster_manager.py#L54

I still think this is a noteworthy consequence of that problem as it points out that the tool is reallly tied to IPython and/or Jupyter notebooks, which I don't really use. At least a warning in the docs would help for now though, that is unless someone has a workaround.

Cheers, and thanks for all the hard work on this, really makes my life with SLURM et al. much easier.

~Sam

@willirath
Copy link
Collaborator

willirath commented Jul 21, 2019 via email

@lesteve
Copy link
Member

lesteve commented Jul 22, 2019

If you want to wait for workers to be available, you can use client.wait_for_workers(n=num_workers) since distributed >= 2 or do it by hand following
dask/distributed#2138 (comment) for example.

More generally, when you run a script using dask you need to have about something that blocks until you get the result you need (e.g. using .result or client.gather). Otherwise, since the dask scheduler lives in your main process, when your script finishes (generally quicker than you expect), the dask scheduler disappears and your dask workers will kill themselves after death-timeout because they are unable to connect to the scheduler.

To be honest this is a caveat when you start with dask and you run python scripts. If you see a good way to add this to the documentation (probably in the distributed project), you are more encouraged to do so!

For the record, dask and its subprojects are not tied to IPython or Jupyter notebook.

I am going to close the issue, @salotz feel free to comment if you feel your answer has not been answered to its full extent.

@lesteve lesteve closed this as completed Jul 22, 2019
@salotz
Copy link
Author

salotz commented Jul 23, 2019

Thanks for the suggestions! One of these will work for me.

@lesteve lesteve mentioned this issue Jul 25, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants