You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I want to launch a cluster on Slurm. Whereas, on each node, a LocalCUDACluster should be launched to use the available GPUs on each node. My sample code looks as follows:
import dask
from dask.distributed import Client
from dask_jobqueue import SLURMCluster
from dask_cuda import LocalCUDACluster
import os
def test():
#return(cuda.get_current_device().id)
return([i.id for i in cuda.list_devices()])
def test_numba_cuda():
cluster = LocalCUDACluster()
client = Client(cluster)
return(cluster.cuda_visible_devices)
queue = "gpus" # batch, gpus, develgpus, etc.
project = "deepacf" # your project: zam, training19xx, etc.
port = 56755
cluster = SLURMCluster(
n_workers=2,
cores=1,
processes=2,
memory="5GB",
shebang="#!/usr/bin/env bash",
queue=queue,
scheduler_options={"dashboard_address": ":"+str(port)},
walltime="00:30:00",
local_directory="/tmp",
death_timeout="30m",
log_directory=f'{os.environ["HOME"]}/dask_jobqueue_logs',
interface="ib0",
project=project,
python="/p/home/jusers/elshambakey1/juwels/jupyter/kernels/dg_rr_analytics/bin/python",
nanny=False,
job_extra=['--gres gpu:4']
)
client=Client(cluster)
ts=[dask.delayed(test_numba_cuda)()]
res=client.compute(ts)
res[0].result()
I had to set nanny=False because, otherwise, I receive an error about daemonized tasks that cannot have children. Thus, I found a similar problem at dask/distributed#2142. So, I set nanny=False. It worked fine when n_workers=1 and processes=1. But when I tried to set both n_workers=2 and processes=2, it fails with the following error: distributed.dask_worker - ERROR - Failed to launch worker. You cannot use the --no-nanny argument when n_workers > 1
Unfortunately, SLURMCluster isn't currently supported by Dask-CUDA. SLURMCluster will create distributed.Workers (or distributed.Nanny, I'm not familiar with its internals), and thus what you're attempting is a to launch a cluster and then try to run LocalCUDACluster within the Worker processes with dask.delayed, that's definitely not going to work.
Anyway, this is a duplicate of #653 and therefore I'll close this so we can continue the discussion there if needed. As a workaround on the same issue a workaround is provided in #653 (comment), which is how people normally get Dask-CUDA setup on Slurm, perhaps something similar will do the trick for you.
Hi
I want to launch a cluster on Slurm. Whereas, on each node, a LocalCUDACluster should be launched to use the available GPUs on each node. My sample code looks as follows:
I had to set
nanny=False
because, otherwise, I receive an error about daemonized tasks that cannot have children. Thus, I found a similar problem at dask/distributed#2142. So, I setnanny=False
. It worked fine whenn_workers=1
andprocesses=1
. But when I tried to set bothn_workers=2
andprocesses=2
, it fails with the following error:distributed.dask_worker - ERROR - Failed to launch worker. You cannot use the --no-nanny argument when n_workers > 1
I wonder how to solve this problem.
PS: I also posted this question at https://stackoverflow.com/questions/73464860/launching-dask-cuda-localcudacluster-within-slurmcluster
The text was updated successfully, but these errors were encountered: