Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

launching dask-cuda LocalCUDACluster within SLURMCluster #980

Closed
shambakey1 opened this issue Aug 23, 2022 · 1 comment
Closed

launching dask-cuda LocalCUDACluster within SLURMCluster #980

shambakey1 opened this issue Aug 23, 2022 · 1 comment

Comments

@shambakey1
Copy link

Hi

I want to launch a cluster on Slurm. Whereas, on each node, a LocalCUDACluster should be launched to use the available GPUs on each node. My sample code looks as follows:

import dask
from dask.distributed import Client
from dask_jobqueue import SLURMCluster
from dask_cuda import LocalCUDACluster
import os

def test():
	#return(cuda.get_current_device().id)
	return([i.id for i in cuda.list_devices()])
	
def test_numba_cuda():
	cluster = LocalCUDACluster()
	client = Client(cluster)
	return(cluster.cuda_visible_devices)
	
queue = "gpus"  #  batch, gpus, develgpus, etc.
project = "deepacf"  # your project: zam, training19xx, etc.
port = 56755

cluster = SLURMCluster(
     n_workers=2,
     cores=1,
     processes=2,
     memory="5GB",
     shebang="#!/usr/bin/env bash",
     queue=queue,
     scheduler_options={"dashboard_address": ":"+str(port)},
     walltime="00:30:00",
     local_directory="/tmp",
     death_timeout="30m",
     log_directory=f'{os.environ["HOME"]}/dask_jobqueue_logs',
     interface="ib0",
     project=project,
     python="/p/home/jusers/elshambakey1/juwels/jupyter/kernels/dg_rr_analytics/bin/python",
     nanny=False,
     job_extra=['--gres gpu:4']
)

client=Client(cluster)
ts=[dask.delayed(test_numba_cuda)()]
res=client.compute(ts)
res[0].result()

I had to set nanny=False because, otherwise, I receive an error about daemonized tasks that cannot have children. Thus, I found a similar problem at dask/distributed#2142. So, I set nanny=False. It worked fine when n_workers=1 and processes=1. But when I tried to set both n_workers=2 and processes=2, it fails with the following error:
distributed.dask_worker - ERROR - Failed to launch worker. You cannot use the --no-nanny argument when n_workers > 1

I wonder how to solve this problem.

PS: I also posted this question at https://stackoverflow.com/questions/73464860/launching-dask-cuda-localcudacluster-within-slurmcluster

@pentschev
Copy link
Member

Unfortunately, SLURMCluster isn't currently supported by Dask-CUDA. SLURMCluster will create distributed.Workers (or distributed.Nanny, I'm not familiar with its internals), and thus what you're attempting is a to launch a cluster and then try to run LocalCUDACluster within the Worker processes with dask.delayed, that's definitely not going to work.

Anyway, this is a duplicate of #653 and therefore I'll close this so we can continue the discussion there if needed. As a workaround on the same issue a workaround is provided in #653 (comment), which is how people normally get Dask-CUDA setup on Slurm, perhaps something similar will do the trick for you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants