You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Thanks for the comprehensive guide! I tried to follow your steps (on Snellius), but I'm not entirely sure if I understand it correctly.
When I submit the jobscript, I request one job on a quarter node (32 cores). This launches a jupyterLab session with a dask scheduler (but no workers). Then, I click the scale button to get workers. I tried with 5. When I did this, I noticed in squeue that 5 new jobs were started, each on a separate node, presumably following the configuration in the dask config file.
If I'm not mistaken, that means I've now requested (and am being charged for) 6 * 32 cores. Especially the scheduler seems wasteful. Wouldn't it be much more efficient if the scheduler used one core, and the remaining 31 cores and memory allocation of my original jobscript would be destined for the workers?
The text was updated successfully, but these errors were encountered:
Hi @Peter9192, thanks for testing and for pointing this out - what you describe is exactly what's happening.
Part of the issue is probably related to the configuration of Snellius, which does not allow for smaller jobs: Spider does not have this constraint, so one can start a small job (one or two cores) to run the Jupyter server and the Dask scheduler and configure larger jobs for the workers. But I guess this is due to Snellius targeting real HPC problems, while Spider being more for high-throughput.
Anyway, right now Dask Jobqueue, which is what we use to setup the Dask cluster on SLURM, does not seem to support the possibility of creating an additional local worker to occupy the resources on the node where Jupyter and the Dask scheduler are running. You could still start a worker to occupy these resources "manually", e.g. by running something like the following shell command in a separate terminal window within JupyterLab (you could get <SCHEDULER_ADDRESS> from client.scheduler.address):
Thanks for the comprehensive guide! I tried to follow your steps (on Snellius), but I'm not entirely sure if I understand it correctly.
When I submit the jobscript, I request one job on a quarter node (32 cores). This launches a jupyterLab session with a dask scheduler (but no workers). Then, I click the
scale
button to get workers. I tried with 5. When I did this, I noticed insqueue
that 5 new jobs were started, each on a separate node, presumably following the configuration in the dask config file.If I'm not mistaken, that means I've now requested (and am being charged for) 6 * 32 cores. Especially the scheduler seems wasteful. Wouldn't it be much more efficient if the scheduler used one core, and the remaining 31 cores and memory allocation of my original jobscript would be destined for the workers?
The text was updated successfully, but these errors were encountered: