Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New job for each worker? #19

Open
Peter9192 opened this issue Oct 27, 2022 · 1 comment
Open

New job for each worker? #19

Peter9192 opened this issue Oct 27, 2022 · 1 comment

Comments

@Peter9192
Copy link

Thanks for the comprehensive guide! I tried to follow your steps (on Snellius), but I'm not entirely sure if I understand it correctly.

When I submit the jobscript, I request one job on a quarter node (32 cores). This launches a jupyterLab session with a dask scheduler (but no workers). Then, I click the scale button to get workers. I tried with 5. When I did this, I noticed in squeue that 5 new jobs were started, each on a separate node, presumably following the configuration in the dask config file.

If I'm not mistaken, that means I've now requested (and am being charged for) 6 * 32 cores. Especially the scheduler seems wasteful. Wouldn't it be much more efficient if the scheduler used one core, and the remaining 31 cores and memory allocation of my original jobscript would be destined for the workers?

@fnattino
Copy link
Contributor

fnattino commented Nov 4, 2022

Hi @Peter9192, thanks for testing and for pointing this out - what you describe is exactly what's happening.

Part of the issue is probably related to the configuration of Snellius, which does not allow for smaller jobs: Spider does not have this constraint, so one can start a small job (one or two cores) to run the Jupyter server and the Dask scheduler and configure larger jobs for the workers. But I guess this is due to Snellius targeting real HPC problems, while Spider being more for high-throughput.

Anyway, right now Dask Jobqueue, which is what we use to setup the Dask cluster on SLURM, does not seem to support the possibility of creating an additional local worker to occupy the resources on the node where Jupyter and the Dask scheduler are running. You could still start a worker to occupy these resources "manually", e.g. by running something like the following shell command in a separate terminal window within JupyterLab (you could get <SCHEDULER_ADDRESS> from client.scheduler.address):

/home/$USER/mambaforge/envs/jupyter_dask/bin/python -m distributed.cli.dask_worker <SCHEDULER_ADDRESS> --nthreads 32 --memory-limit 56GiB --name LocalWorker --nanny --death-timeout 600 --local-directory $TMPDIR

but I agree this is less than optimal..

I have opened an issue here, let's see whether there is interest in solving this within Dask Jobqueue: dask/dask-jobqueue#596

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants