DASK Deployment using SLURM with GPUs #1381

AquifersBSIM · 2024-09-06T02:37:56Z

Describe the issue:
I am running into an issue with deploying dask using LocalCUDACluster() on an HPC. I am trying to do RandomForest, and the amount of data I am inputting exits the limit of a single GPU. Hence, I am trying to utilize several GPUs to split the datasets. To start with I did, the following is just an example script (from DASK GitHub front page) which is shown in the code:

Minimal Complete Verifiable Example:

import glob

def main():

    # Read CSV file in parallel across workers
    import dask_cudf
    df = dask_cudf.read_csv(glob.glob("*.csv"))

    # Fit a NearestNeighbors model and query it
    from cuml.dask.neighbors import NearestNeighbors
    nn = NearestNeighbors(n_neighbors = 10, client=client)
    nn.fit(df)
    neighbors = nn.kneighbors(df)

if __name__ == "__main__":

    # Initialize UCX for high-speed transport of CUDA arrays
    from dask_cuda import LocalCUDACluster

    # Create a Dask single-node CUDA cluster w/ one worker per device
    cluster = LocalCUDACluster()
    
    from dask.distributed import Client
    client = Client(cluster)

    main()

In addition to that, I have this submission script

#!/bin/bash
#
#SBATCH --job-name=dask_examples
#SBATCH --output=dask_examples.txt
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=8
#SBATCH --ntasks-per-node=1
#SBATCH --time=01:00:00
#SBATCH --mem-per-cpu=5G
#SBATCH --gres=gpu:4
ml conda
conda activate /fred/oz241/BSIM/conda_SVM/SVM
/usr/bin/time -v python 1.py

Error Message

Task exception was never retrieved
future: <Task finished name='Task-543' coro=<_wrap_awaitable() done, defined at /fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/depl
oy/spec.py:124> exception=RuntimeError('Worker failed to start.')>
Traceback (most recent call last):
  File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/core.py", line 523, in start
    await wait_for(self.start_unsafe(), timeout=timeout)
    ^^^^^^^^^^^^^^^^^
  File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/utils.py", line 1952, in wait_for
    return await fut
  File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/worker.py", line 1474, in start_unsafe
    raise plugins_exceptions[0]
    ^^^^^^^^^^^^^^^^^
  File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/utils.py", line 837, in wrapper
    return await func(*args, **kwargs)
    ^^^^^^^^^^^^^^^^^
  File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/worker.py", line 1876, in plugin_add
    result = plugin.setup(worker=self)
    ^^^^^^^^^^^^^^^^^
  File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/dask_cuda/plugins.py", line 14, in setup
    os.sched_setaffinity(0, self.cores)
    ^^^^^^^^^^^^^^^^^
OSError: [Errno 22] Invalid argument

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/deploy/spec.py", line 125, in _wrap_awaitable
    return await aw
           ^^^^^^^^
  File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/core.py", line 512, in start
    raise self.__startup_exc
  File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/core.py", line 523, in start
    await wait_for(self.start_unsafe(), timeout=timeout)
  File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/utils.py", line 1952, in wait_for
    return await fut
           ^^^^^^^^^
  File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/nanny.py", line 369, in start_unsafe
    response = await self.instantiate()
               ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/nanny.py", line 452, in instantiate
    result = await self.process.start()
             ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/nanny.py", line 759, in start
    msg = await self._wait_until_connected(uid)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/nanny.py", line 901, in _wait_until_connected
    raise msg["exception"]
  File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/nanny.py", line 965, in run
    async with worker:
    ^^^^^^^^^^^^^^^^^
  File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/core.py", line 537, in __aenter__
    await self
  File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/core.py", line 531, in start
    raise RuntimeError(f"{type(self).__name__} failed to start.") from exc
    ^^^^^^^^^^^^^^^^^
RuntimeError: Worker failed to start.
Task exception was never retrieved
future: <Task finished name='Task-541' coro=<_wrap_awaitable() done, defined at /fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/depl
oy/spec.py:124> exception=RuntimeError('Worker failed to start.')>
Traceback (most recent call last):
  File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/core.py", line 523, in start
    await wait_for(self.start_unsafe(), timeout=timeout)
    ^^^^^^^^^^^^^^^^^
  File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/utils.py", line 1952, in wait_for
    return await fut
  File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/worker.py", line 1474, in start_unsafe
    raise plugins_exceptions[0]
    ^^^^^^^^^^^^^^^^^
  File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/utils.py", line 837, in wrapper
    return await func(*args, **kwargs)
    ^^^^^^^^^^^^^^^^^
  File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/worker.py", line 1876, in plugin_add
    result = plugin.setup(worker=self)
    ^^^^^^^^^^^^^^^^^
  File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/dask_cuda/plugins.py", line 14, in setup
    os.sched_setaffinity(0, self.cores)
    ^^^^^^^^^^^^^^^^^
OSError: [Errno 22] Invalid argument

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/deploy/spec.py", line 125, in _wrap_awaitable
    return await aw
           ^^^^^^^^
  File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/core.py", line 512, in start
    raise self.__startup_exc
  File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/core.py", line 523, in start
    await wait_for(self.start_unsafe(), timeout=timeout)
  File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/utils.py", line 1952, in wait_for
    return await fut
           ^^^^^^^^^
  File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/nanny.py", line 369, in start_unsafe
    response = await self.instantiate()
               ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/nanny.py", line 452, in instantiate
    result = await self.process.start()
             ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/nanny.py", line 759, in start
    msg = await self._wait_until_connected(uid)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/nanny.py", line 901, in _wait_until_connected
    raise msg["exception"]
  File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/nanny.py", line 965, in run
    async with worker:
    ^^^^^^^^^^^^^^^^^
  File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/core.py", line 537, in __aenter__
    await self
  File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/core.py", line 531, in start
    raise RuntimeError(f"{type(self).__name__} failed to start.") from exc
    ^^^^^^^^^^^^^^^^^
RuntimeError: Worker failed to start.

Anything else we need to know?:
The traceback was pretty long, I gave only a snippet of it

Environment:

Dask version: 2024.7.1
dask-jobqueue: 0.9.0
Python version: 3.11.9
Operating System: Linux (Slurm HPC)
Install method (conda, pip, source): conda

The text was updated successfully, but these errors were encountered:

pentschev · 2024-09-09T08:27:06Z

Could you please report the output of nvidia-smi topo -m and the output of the script below? Please make sure both run on some Slurm node where you experienced the original failure as reported above.

print_affinity.py

import pynvml
from dask_cuda.utils import get_cpu_affinity


pynvml.nvmlInit()
for i in range(pynvml.nvmlDeviceGetCount()):
    cpu_affinity = get_cpu_affinity(i)
    print(type(get_cpu_affinity(i)), get_cpu_affinity(i))

AquifersBSIM · 2024-09-11T02:50:19Z

Hi @pentschev , I have forgotted to mention that I have disabled the "os.sched_setaffinity(0, self.cores)", as attached below

class CPUAffinity(WorkerPlugin):
    def __init__(self, cores):
        self.cores = cores

    def setup(self, worker=None):
        pass
        #os.sched_setaffinity(0, self.cores)

pentschev · 2024-09-11T07:17:20Z

Keep in mind doing that will likely result in degraded performance. Here's a previous comment I wrote about this on a similar issue.

AquifersBSIM · 2024-09-11T10:43:34Z

Thank you @pentschev for the reply on me disabling the os.sched_setaffinity. I probably need some time to report the output of nvidia-smi topo -m.

Regarding the "print_affinity.py":
Do i have to enable back the os.sched_setaffinity for it work?

AquifersBSIM · 2024-09-12T06:58:49Z

Hi @pentschev, Here are the reports of nvidia-smi topo -m and the print_affinity.py. For your information, I have not enabled the os.sched_setaffinity.

nvidia-smi topo -m output

        GPU0    GPU1    GPU2    NIC0    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      NV4     NV4     SYS             3               N/A
GPU1    NV4      X      NV4     SYS             1               N/A
GPU2    NV4     NV4      X      NODE            5               N/A
NIC0    SYS     SYS     NODE     X

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_0

print_affinity.py output

<class 'list'> []
<class 'list'> []
<class 'list'> []

pentschev · 2024-09-16T14:31:40Z

@AquifersBSIM can you clarify what you mean by "I have not enabled the os.sched_setaffinity"? Do you mean that when you ran the above you had the line commented out as in your previous #1381 (comment)? If so, that doesn't really matter for the experiment above.

In any case, that unfortunately didn't really clarify whether the failure was in obtaining the CPU affinity or something else happened. Would you please run the following modified version of the script on the compute node?

print_affinity2.py

import math
from multiprocessing import cpu_count

import pynvml


pynvml.nvmlInit()
for i in range(pynvml.nvmlDeviceGetCount()):
    handle = pynvml.nvmlDeviceGetHandleByIndex(i)
    cpu_affinity = pynvml.nvmlDeviceGetCpuAffinity(handle, math.ceil(cpu_count() / 64))
    print(list(cpu_affinity))

Furthermore, the output of nvidia-smi topo -m looks very unusual on that system, do you know if you're getting just a partition of the node or if you should have the full node with exclusive access for your allocation? Could you also post the information from cat /proc/cpuinfo from that node?

AquifersBSIM · 2024-09-23T04:35:04Z

Hello @pentschev, regarding the "os.sched_setaffinity", I had the line commented out.

Regarding the do you know if you're getting just a partition of the node or if you should have the full node with exclusive access for your allocation? question. I am sure I am just getting a partition of the node.

Information from print_affinity2.py

[0]
[32768]
[0]

Information from cat /proc/cpuinfo
The information is very lengthy, and if its alright, here is a snippet of it

processor       : 1
vendor_id       : AuthenticAMD
cpu family      : 25
model           : 1
model name      : AMD EPYC 7543 32-Core Processor
stepping        : 1
microcode       : 0xa0011d5
cpu MHz         : 3662.940
cache size      : 512 KB
physical id     : 0
siblings        : 32
core id         : 1
cpu cores       : 32
apicid          : 1
initial apicid  : 1
fpu             : yes
fpu_exception   : yes
cpuid level     : 16
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rd
tscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt
 aes xsave avx f16c rdrand lahf_lm cmp_legacy extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb
 bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 invpcid cqm rdt_a rdseed adx smap clflus
hopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr rdpru wbnoinvd amd_ppin brs arat
npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold v_vmsave_vmload vgif v_spec_ctrl umip pku ospke vaes vpcl
mulqdq rdpid overflow_recov succor smca debug_swap
bugs            : sysret_ss_attrs spectre_v1 spectre_v2 spec_store_bypass srso
bogomips        : 5589.37
TLB size        : 2560 4K pages
clflush size    : 64
cache_alignment : 64
address sizes   : 48 bits physical, 48 bits virtual
power management: ts ttp tm hwpstate cpb eff_freq_ro [13] [14]

pentschev · 2024-10-03T12:14:54Z

So if you're getting only a partition of the node, does that mean you don't have access to all the CPU cores as well? That could be the reason why properly determining the CPU affinity fails, and TBH I have no experience with that sort of partitioning and don't know if that is indeed supported by NVML either. If you know details, can you provide more information about the CPU status, e.g., how many physical CPUs (i.e., sockets) are there, how many cores you actually see with /proc/cpuinfo, and anything else that you can provide for us to better understand what the topology of the system/cluster?

jrbourbeau mentioned this issue Sep 6, 2024

DASK Deployment using SLURM with GPUs dask/distributed#8857

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DASK Deployment using SLURM with GPUs #1381

DASK Deployment using SLURM with GPUs #1381

AquifersBSIM commented Sep 6, 2024

pentschev commented Sep 9, 2024

AquifersBSIM commented Sep 11, 2024

pentschev commented Sep 11, 2024

AquifersBSIM commented Sep 11, 2024 •

edited

Loading

AquifersBSIM commented Sep 12, 2024 •

edited

Loading

pentschev commented Sep 16, 2024

AquifersBSIM commented Sep 23, 2024

pentschev commented Oct 3, 2024 •

edited

Loading

DASK Deployment using SLURM with GPUs #1381

DASK Deployment using SLURM with GPUs #1381

Comments

AquifersBSIM commented Sep 6, 2024

pentschev commented Sep 9, 2024

AquifersBSIM commented Sep 11, 2024

pentschev commented Sep 11, 2024

AquifersBSIM commented Sep 11, 2024 • edited Loading

AquifersBSIM commented Sep 12, 2024 • edited Loading

pentschev commented Sep 16, 2024

AquifersBSIM commented Sep 23, 2024

pentschev commented Oct 3, 2024 • edited Loading

AquifersBSIM commented Sep 11, 2024 •

edited

Loading

AquifersBSIM commented Sep 12, 2024 •

edited

Loading

pentschev commented Oct 3, 2024 •

edited

Loading