launching dask-cuda LocalCUDACluster within SLURMCluster #980

shambakey1 · 2022-08-23T20:45:13Z

Hi

I want to launch a cluster on Slurm. Whereas, on each node, a LocalCUDACluster should be launched to use the available GPUs on each node. My sample code looks as follows:

import dask
from dask.distributed import Client
from dask_jobqueue import SLURMCluster
from dask_cuda import LocalCUDACluster
import os

def test():
	#return(cuda.get_current_device().id)
	return([i.id for i in cuda.list_devices()])
	
def test_numba_cuda():
	cluster = LocalCUDACluster()
	client = Client(cluster)
	return(cluster.cuda_visible_devices)
	
queue = "gpus"  #  batch, gpus, develgpus, etc.
project = "deepacf"  # your project: zam, training19xx, etc.
port = 56755

cluster = SLURMCluster(
     n_workers=2,
     cores=1,
     processes=2,
     memory="5GB",
     shebang="#!/usr/bin/env bash",
     queue=queue,
     scheduler_options={"dashboard_address": ":"+str(port)},
     walltime="00:30:00",
     local_directory="/tmp",
     death_timeout="30m",
     log_directory=f'{os.environ["HOME"]}/dask_jobqueue_logs',
     interface="ib0",
     project=project,
     python="/p/home/jusers/elshambakey1/juwels/jupyter/kernels/dg_rr_analytics/bin/python",
     nanny=False,
     job_extra=['--gres gpu:4']
)

client=Client(cluster)
ts=[dask.delayed(test_numba_cuda)()]
res=client.compute(ts)
res[0].result()

I had to set nanny=False because, otherwise, I receive an error about daemonized tasks that cannot have children. Thus, I found a similar problem at dask/distributed#2142. So, I set nanny=False. It worked fine when n_workers=1 and processes=1. But when I tried to set both n_workers=2 and processes=2, it fails with the following error:
distributed.dask_worker - ERROR - Failed to launch worker. You cannot use the --no-nanny argument when n_workers > 1

I wonder how to solve this problem.

PS: I also posted this question at https://stackoverflow.com/questions/73464860/launching-dask-cuda-localcudacluster-within-slurmcluster

The text was updated successfully, but these errors were encountered:

pentschev · 2022-08-24T08:48:02Z

Unfortunately, SLURMCluster isn't currently supported by Dask-CUDA. SLURMCluster will create distributed.Workers (or distributed.Nanny, I'm not familiar with its internals), and thus what you're attempting is a to launch a cluster and then try to run LocalCUDACluster within the Worker processes with dask.delayed, that's definitely not going to work.

Anyway, this is a duplicate of #653 and therefore I'll close this so we can continue the discussion there if needed. As a workaround on the same issue a workaround is provided in #653 (comment), which is how people normally get Dask-CUDA setup on Slurm, perhaps something similar will do the trick for you.

pentschev closed this as completed Aug 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

launching dask-cuda LocalCUDACluster within SLURMCluster #980

launching dask-cuda LocalCUDACluster within SLURMCluster #980

shambakey1 commented Aug 23, 2022

pentschev commented Aug 24, 2022

launching dask-cuda LocalCUDACluster within SLURMCluster #980

launching dask-cuda LocalCUDACluster within SLURMCluster #980

Comments

shambakey1 commented Aug 23, 2022

pentschev commented Aug 24, 2022