-
-
Notifications
You must be signed in to change notification settings - Fork 144
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make SLURMCluster.scale() to request multiple nodes at once #459
Comments
I don't see an easy path towards fully supported multi-node jobs in Dask jobqueue as it's designed to serve as a common or at least very similar interface to many different batch scheduling systems. I'm not familiar with many of the supported batch scheduling systems, but none I know has something that is as flexible as the There might, however, be a simple workaround that could help with most of your scenarios and if you don't care about fine-grained
Note that (cc: @kathoef who might be interested in giving this a try as well.) |
Also, for Dask clusters of static size, dask-mpi could be the right solution. |
Just tried this. The hard coded |
Hi @demaheim, see also #196. It is about job arrays, but I think it shares your concern, and give some explenation on why this is not implemented in dask-jobqueue. TL;DR; the current approach is to find a workaround to avoir complexifying dask-jobqueue code. So if you find one for Slurm, this would be very welcomed! |
Hey @willirath, could you try to use |
Hi @willirath and @guillaumeeb ,
but there are still only single node requests:
Later I will try
|
I think dask-jobqueue/dask_jobqueue/core.py Line 284 in d7afbfe
The following could do:
|
You can check the job script with print(clutster.job_script()) and make sure the header looks fine. |
When I use
it looks good (requesting 9 nodes at once)
cluster.job_script():
but when I use numbers > 9, e.g. 10:
only one node is requested:
cluster.job_script():
(sorry, I need to logoff the cluster now, I will try again tomorrow) |
This looks like expected (but somewhat buggy, see #461) behaviour: The Job script with 9 nodes looks promising. But you'll need to use python="srun -N 9 -n 9 /home/dheim/miniconda3/bin/python" to make sure you're really getting 9 tasks each running a worker. |
The dirty workaround first looks good:
leads to
but
seems to result in many errors
|
There are a few potential issues that could be triggering this, one is that (to my understanding) there is likely no guarantee from the the launcher ( The other is that even if they were available, your 9 workers would all share the same With the current approach what you are really looking for is to use dask-mpi (without a scheduler) from within dask-jobqueue...might not be so hard (especially if you a are already willing to hack the headers) but I imagine it is uncharted territory. I don't have a general solution for this, the job array discussion in #196 would probably serve your needs but that might be some time away. |
Thank you for all your help! I will try |
I think there is a workaround for this with #480. Closing this one. |
Our cluster prefers multiple node requests over single node requests.
The problem is that
Our current workaround to speedup scaling is to start the cluster via the script
In this case, the 11 nodes are requested at once which is much faster than making 11 single requests.
How can SLURMCluster.scale() be adapted to request all the 11 nodes at once?
The text was updated successfully, but these errors were encountered: