-
-
Notifications
You must be signed in to change notification settings - Fork 136
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Can dask-jobqueue use multi-node jobs (Was: Creating dask-jobqueue cluster from a single job versus multiple jobs)? #364
Comments
Hi @wtbarnes, thanks for the question!
Nope, it does not. There has been some discussion about this in the past. Dask-jobqueue is quite simple, and does not handle multi node jobs. As you say, it is an anti-pattern. In order to do this, you probably want to look as Dask-mpi. For your remaining questions, the dask-worker is a process which will run only on one node, so the first one from your reservation. Other nodes you ask from PBS will do nothing, remaining idle. Your worker is probably killed because it exceeds the resources available on its node. Another thing you may try is stick with dask-jobqueue, but submit smaller jobs, e.g. |
@guillaumeeb Thanks for the detailed reply. This is what I suspected, but wanted to make sure my intuition was correct! I'll go ahead and close this issue, but may add comments following further discussion with the Pleiades folks. Thanks for the Dask-mpi suggestion as well! This may be a good solution for their proposed single-job, high-availability queue. |
@guillaumeeb I have to correct something I said previously:
Rather, for the example single-job, multi-node example that I gave above, this would be Does this overcome the multi-node limitation of dask-jobqueue? Or will work still only be allocated over a single node? |
I think the answer is the same dask-jobqueue does not know how to use multi-node jobs. It seems like you understand your job scheduler well, so to understand what dask-jobqueue is doing in terms of jobs: Full disclosure: the scheduler I know the most about is SGE, which does not have the multi-node jobs feature, so I don't really understand how multi-node jobs are used in practice with SLURM or other schedulers. Side-comment: with the 0.7 release you can use |
Now I'm confused. My jobqueue.yaml is as follows: # Dask worker options
Then in Python I do:
#PBS -N dask-worker /home7/jcbecker/.conda/envs/geo/bin/python -m distributed.cli.dask_worker tcp://10.150.27.18:43961 --nthreads 1 --nprocs 200 --memory-limit 10.00GB --name dask-worker--${JOB_ID}-- --death-timeout 60 --local-directory /nobackup/jcbecker/dask --interface ib0 So PBS gave me 25 nodes on which to run 200 processes. Isn't that how many workers I get according to cluster.job_script()? Please enlighten me. |
To answer @wtbarnes and @jeffcbecker questions:
Yes, PBS gives you 25 nodes. Some batch scheduler such as slurm proposes other abstractions to just spawn processes on any node of the reservation. PBS has pbs_dsh, but nothing like that is happening here. |
Thank you for the clarification. I changed my jobqueue.yaml to have the following
And checking with qstat indicates that each job only requested 8 cpus, not 24 like I specified. Isn't this wrong? Note that each node is dual socket with 12 cores/socket, so why did Dask change my request from 24 cpus to 8? |
Not sure what changed, but it's working correctly now - each job requests 24 cpus (cores) |
@guillaumeeb @lesteve Thanks for all of your help on this. I think we have a more clear picture about how to proceed with our cluster configuration on Pleiades. I'm going to close this (again!) as we seemed to have resolved our main issue, but will reopen if we run into more problems. |
FYI I changed the title to reflect the discussion. Feel free to edit it or suggest a better title! |
This is not strictly an "issue" and more a question about suggested usage so if this question belongs somewhere else, please direct me there!
I've been working closely with admins of the NASA Pleiades HPC system on how best to support interactive Dask workflows on that system. Pleiades uses PBS. Thus far, my workflow has been to configure a cluster in which a single worker uses all available cores and memory on a single node. For example, for a machine that has 12 cores and 48 GB of memory per node, my jobqueue config is the following:
I then request 10-20 nodes by starting an equivalent number of jobs, e.g by running
cluster.scale(10)
. A lot of the time, this configuration works well, but one of the problems that has cropped up (and is common to many systems) is high availability of resources, i.e. when the cluster is busy, I may have to wait > 30 minutes for even a single job to start; not so ideal for interactive workflows! The entire Pleiades system is in very high demand so this is often an issue, particularly on the newer, faster processors.After raising this issue with the HPC staff, one suggested solution was to use a high-availability queue (called "devel") that allows users to submit only a single job at a time, but has very high availability (i.e. short waiting times). In this case, the suggested pattern would be to submit a single job that requests multiple cores on multiple nodes. For 25 nodes, each with 24 cores and 128 GB of memory, the jobqueue config is:
Then, the user would call
cluster.scale(1)
once and be done. This satisfies the 1 job restriction of the high-availability queue, reduces wait times as you have asked for all resources up front, and is also more "friendly" to the scheduler as it does not involve submitting many jobs. This pattern of course does not permit any scaling up or down, but that is a separate issue.This is quite different than the multi-job workflow I've used previously (and the one that seems to be recommended in the dask-jobqueue docs) and I'm trying to wrap my head around whether this makes sens. My main questions are :
I've experimented with the above single-job workflow (and minor variations on it) and have found that computations (which worked just fine in the multi-job context) will lock up and or result in a killed worker. However, it is not entirely clear to my why this is happening.
I apologize for the lengthy post! I'm trying to get a sense of what is the most optimal usage pattern here in the context of many different configuration options and trying to wrap my head around how all of this actually works. Any advice would be extremely helpful!
The text was updated successfully, but these errors were encountered: