-
-
Notifications
You must be signed in to change notification settings - Fork 719
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Wait for workers to join before continuing #2138
Comments
Hi there. From dask-jobqueue perspective, I don't think it would be a good thing to wait until all workers join. Maybe block until one worker is there, but with job scheduler, waiting for all of them could take hours if not more. Maybe add a kwarg to scale() ? From what I remember, using the plugin as suggested by @mrocklin looks simple enough. |
You make a good point. I'm making a few assumptions here. The first is that I'm using a cluster which doesn't make me wait very long to give me workers (the cloud). I'm talking about interactive jobs rather than batch jobs. I'm also assuming that I'm doing something brittle which needs all the workers before starting. The primary example of this for me right now is doing a live demo, I don't want it to run on one worker as it will be underwhelming. I want to scale and wait until I have a nice big cluster before plowing through my graph. I appreciate that might not be a good enough reason to do this. I could write a little script to block until workers reaches n for live demos. |
I think it would make sense to have an explicit |
Perhaps we could have |
Yep, I had something like this in mind. Being able to wait for a user defined number of workers to be online is something that would definitly be useful! |
FWIW code like this has served this purpose quite well for us. Would be easy enough for someone to add it if they wish. while ((client.status == "running") and (len(client.scheduler_info()["workers"]) < nworkers)):
sleep(1.0) |
Could we use a non active wait solution for that? Maybe using http://distributed.dask.org/en/latest/plugins.html#scheduler-plugins and some callback ? Do you have suggestion on this @mrocklin ? |
Adding some constraints here: Currently the We might solve this in a couple of ways:
|
Just for my understanding, if we do not modify |
In case someone wanted to use wait from within an asynchronous environment. async with SGECluster(...) as cluster:
cluster.scale(10)
await cluster.wait()
x.compute() |
As another data point, this is impacting my use of Dask in an HPC environment (and some thoughts on CLI integration below). My jobs are started by a job scheduler (SLURM), and at the start of the job I want to spin up a scheduler and set of workers. For reasons not worth mentioning here, I'm doing this by calling
There is a race between the three steps. This is not a problem for correctness since the worker knows to retry if the scheduler isn't up yet, and the scheduler knows to wait for at least one worker to arrive before allowing the client to start running. But in my case this is a performance issue because I want to do accurate timing of execution, and if the client starts before all workers have arrived then the run isn't reflective of the steady state performance. Currently I'm doing a fairly awful workaround based on parsing the log files from scheduler and counting the number of times it reports workers. You can see my script below. I'm not sure how exactly the API mentioned in the original post would integrate with my workflow; presumably this would need to be exposed via the CLI somehow. I would prefer not to have to write custom scripts to start Dask but I suppose I can do that if absolutely necessary. |
Could you create the client in the script, and then poll |
I think could make that work. In my case I know how many nodes I'm planning to boot, so I can just tell that to the client. Is there API documentation for this? I was trying to figure out if e.g. the |
https://distributed.readthedocs.io/en/latest/api.html#distributed.Client.ncores The method just gives information of the scheduler's view of the cluster. |
You can also look at Client.scheduler_info, which is a decent all-purpose
method to get information about the general state.
…On Fri, Apr 12, 2019 at 3:49 PM Martin Durant ***@***.***> wrote:
https://distributed.readthedocs.io/en/latest/api.html#distributed.Client.ncores
The method just gives information of the scheduler's view of the cluster.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#2138 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AASszLv2C8QIZ0BlSWL31ROErQ2A-CSfks5vgPFzgaJpZM4VaoCy>
.
|
But, we should probably solve this problem properly at some point. It's
well in-scope for the project. If you or anyone you know would like to
work on it we'd be happy to help walk them through it. Otherwise hopefully
it gets implemented within a few months.
…On Fri, Apr 12, 2019 at 4:15 PM Matthew Rocklin ***@***.***> wrote:
You can also look at Client.scheduler_info, which is a decent all-purpose
method to get information about the general state.
On Fri, Apr 12, 2019 at 3:49 PM Martin Durant ***@***.***>
wrote:
>
> https://distributed.readthedocs.io/en/latest/api.html#distributed.Client.ncores
>
> The method just gives information of the scheduler's view of the cluster.
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <#2138 (comment)>,
> or mute the thread
> <https://github.com/notifications/unsubscribe-auth/AASszLv2C8QIZ0BlSWL31ROErQ2A-CSfks5vgPFzgaJpZM4VaoCy>
> .
>
|
@elliottslaughter I guess you using dask-jobqueue? Have you seen this PR: dask/dask-jobqueue#223 by @danpf? It is not properly finished, but may give you some insights. |
Moved from dask/dask-kubernetes#87
It would be useful to have the
cluster.scale()
method block until the workers join. This was originally raised in dask-kubernetes but is applicable to dask-jobqueue and dask-yarn so would make more sense to implement here.Suggestions from @mrocklin:
The text was updated successfully, but these errors were encountered: