Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DASK Deployment using SLURM with GPUs #1381

Open
AquifersBSIM opened this issue Sep 6, 2024 · 8 comments
Open

DASK Deployment using SLURM with GPUs #1381

AquifersBSIM opened this issue Sep 6, 2024 · 8 comments

Comments

@AquifersBSIM
Copy link

Describe the issue:
I am running into an issue with deploying dask using LocalCUDACluster() on an HPC. I am trying to do RandomForest, and the amount of data I am inputting exits the limit of a single GPU. Hence, I am trying to utilize several GPUs to split the datasets. To start with I did, the following is just an example script (from DASK GitHub front page) which is shown in the code:

Minimal Complete Verifiable Example:

import glob

def main():

    # Read CSV file in parallel across workers
    import dask_cudf
    df = dask_cudf.read_csv(glob.glob("*.csv"))

    # Fit a NearestNeighbors model and query it
    from cuml.dask.neighbors import NearestNeighbors
    nn = NearestNeighbors(n_neighbors = 10, client=client)
    nn.fit(df)
    neighbors = nn.kneighbors(df)

if __name__ == "__main__":

    # Initialize UCX for high-speed transport of CUDA arrays
    from dask_cuda import LocalCUDACluster

    # Create a Dask single-node CUDA cluster w/ one worker per device
    cluster = LocalCUDACluster()
    
    from dask.distributed import Client
    client = Client(cluster)

    main()

In addition to that, I have this submission script

#!/bin/bash
#
#SBATCH --job-name=dask_examples
#SBATCH --output=dask_examples.txt
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=8
#SBATCH --ntasks-per-node=1
#SBATCH --time=01:00:00
#SBATCH --mem-per-cpu=5G
#SBATCH --gres=gpu:4
ml conda
conda activate /fred/oz241/BSIM/conda_SVM/SVM
/usr/bin/time -v python 1.py

Error Message

Task exception was never retrieved
future: <Task finished name='Task-543' coro=<_wrap_awaitable() done, defined at /fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/depl
oy/spec.py:124> exception=RuntimeError('Worker failed to start.')>
Traceback (most recent call last):
  File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/core.py", line 523, in start
    await wait_for(self.start_unsafe(), timeout=timeout)
    ^^^^^^^^^^^^^^^^^
  File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/utils.py", line 1952, in wait_for
    return await fut
  File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/worker.py", line 1474, in start_unsafe
    raise plugins_exceptions[0]
    ^^^^^^^^^^^^^^^^^
  File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/utils.py", line 837, in wrapper
    return await func(*args, **kwargs)
    ^^^^^^^^^^^^^^^^^
  File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/worker.py", line 1876, in plugin_add
    result = plugin.setup(worker=self)
    ^^^^^^^^^^^^^^^^^
  File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/dask_cuda/plugins.py", line 14, in setup
    os.sched_setaffinity(0, self.cores)
    ^^^^^^^^^^^^^^^^^
OSError: [Errno 22] Invalid argument

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/deploy/spec.py", line 125, in _wrap_awaitable
    return await aw
           ^^^^^^^^
  File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/core.py", line 512, in start
    raise self.__startup_exc
  File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/core.py", line 523, in start
    await wait_for(self.start_unsafe(), timeout=timeout)
  File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/utils.py", line 1952, in wait_for
    return await fut
           ^^^^^^^^^
  File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/nanny.py", line 369, in start_unsafe
    response = await self.instantiate()
               ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/nanny.py", line 452, in instantiate
    result = await self.process.start()
             ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/nanny.py", line 759, in start
    msg = await self._wait_until_connected(uid)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/nanny.py", line 901, in _wait_until_connected
    raise msg["exception"]
  File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/nanny.py", line 965, in run
    async with worker:
    ^^^^^^^^^^^^^^^^^
  File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/core.py", line 537, in __aenter__
    await self
  File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/core.py", line 531, in start
    raise RuntimeError(f"{type(self).__name__} failed to start.") from exc
    ^^^^^^^^^^^^^^^^^
RuntimeError: Worker failed to start.
Task exception was never retrieved
future: <Task finished name='Task-541' coro=<_wrap_awaitable() done, defined at /fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/depl
oy/spec.py:124> exception=RuntimeError('Worker failed to start.')>
Traceback (most recent call last):
  File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/core.py", line 523, in start
    await wait_for(self.start_unsafe(), timeout=timeout)
    ^^^^^^^^^^^^^^^^^
  File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/utils.py", line 1952, in wait_for
    return await fut
  File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/worker.py", line 1474, in start_unsafe
    raise plugins_exceptions[0]
    ^^^^^^^^^^^^^^^^^
  File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/utils.py", line 837, in wrapper
    return await func(*args, **kwargs)
    ^^^^^^^^^^^^^^^^^
  File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/worker.py", line 1876, in plugin_add
    result = plugin.setup(worker=self)
    ^^^^^^^^^^^^^^^^^
  File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/dask_cuda/plugins.py", line 14, in setup
    os.sched_setaffinity(0, self.cores)
    ^^^^^^^^^^^^^^^^^
OSError: [Errno 22] Invalid argument

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/deploy/spec.py", line 125, in _wrap_awaitable
    return await aw
           ^^^^^^^^
  File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/core.py", line 512, in start
    raise self.__startup_exc
  File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/core.py", line 523, in start
    await wait_for(self.start_unsafe(), timeout=timeout)
  File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/utils.py", line 1952, in wait_for
    return await fut
           ^^^^^^^^^
  File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/nanny.py", line 369, in start_unsafe
    response = await self.instantiate()
               ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/nanny.py", line 452, in instantiate
    result = await self.process.start()
             ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/nanny.py", line 759, in start
    msg = await self._wait_until_connected(uid)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/nanny.py", line 901, in _wait_until_connected
    raise msg["exception"]
  File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/nanny.py", line 965, in run
    async with worker:
    ^^^^^^^^^^^^^^^^^
  File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/core.py", line 537, in __aenter__
    await self
  File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/core.py", line 531, in start
    raise RuntimeError(f"{type(self).__name__} failed to start.") from exc
    ^^^^^^^^^^^^^^^^^
RuntimeError: Worker failed to start.

Anything else we need to know?:
The traceback was pretty long, I gave only a snippet of it

Environment:

  • Dask version: 2024.7.1
  • dask-jobqueue: 0.9.0
  • Python version: 3.11.9
  • Operating System: Linux (Slurm HPC)
  • Install method (conda, pip, source): conda
@pentschev
Copy link
Member

Could you please report the output of nvidia-smi topo -m and the output of the script below? Please make sure both run on some Slurm node where you experienced the original failure as reported above.

print_affinity.py
import pynvml
from dask_cuda.utils import get_cpu_affinity


pynvml.nvmlInit()
for i in range(pynvml.nvmlDeviceGetCount()):
    cpu_affinity = get_cpu_affinity(i)
    print(type(get_cpu_affinity(i)), get_cpu_affinity(i))

@AquifersBSIM
Copy link
Author

Hi @pentschev , I have forgotted to mention that I have disabled the "os.sched_setaffinity(0, self.cores)", as attached below

class CPUAffinity(WorkerPlugin):
    def __init__(self, cores):
        self.cores = cores

    def setup(self, worker=None):
        pass
        #os.sched_setaffinity(0, self.cores)

@pentschev
Copy link
Member

Keep in mind doing that will likely result in degraded performance. Here's a previous comment I wrote about this on a similar issue.

@AquifersBSIM
Copy link
Author

AquifersBSIM commented Sep 11, 2024

Thank you @pentschev for the reply on me disabling the os.sched_setaffinity. I probably need some time to report the output of nvidia-smi topo -m.

Regarding the "print_affinity.py":
Do i have to enable back the os.sched_setaffinity for it work?

@AquifersBSIM
Copy link
Author

AquifersBSIM commented Sep 12, 2024

Hi @pentschev, Here are the reports of nvidia-smi topo -m and the print_affinity.py. For your information, I have not enabled the os.sched_setaffinity.

nvidia-smi topo -m output

        GPU0    GPU1    GPU2    NIC0    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      NV4     NV4     SYS             3               N/A
GPU1    NV4      X      NV4     SYS             1               N/A
GPU2    NV4     NV4      X      NODE            5               N/A
NIC0    SYS     SYS     NODE     X

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_0

print_affinity.py output

<class 'list'> []
<class 'list'> []
<class 'list'> []

@pentschev
Copy link
Member

@AquifersBSIM can you clarify what you mean by "I have not enabled the os.sched_setaffinity"? Do you mean that when you ran the above you had the line commented out as in your previous #1381 (comment)? If so, that doesn't really matter for the experiment above.

In any case, that unfortunately didn't really clarify whether the failure was in obtaining the CPU affinity or something else happened. Would you please run the following modified version of the script on the compute node?

print_affinity2.py
import math
from multiprocessing import cpu_count

import pynvml


pynvml.nvmlInit()
for i in range(pynvml.nvmlDeviceGetCount()):
    handle = pynvml.nvmlDeviceGetHandleByIndex(i)
    cpu_affinity = pynvml.nvmlDeviceGetCpuAffinity(handle, math.ceil(cpu_count() / 64))
    print(list(cpu_affinity))

Furthermore, the output of nvidia-smi topo -m looks very unusual on that system, do you know if you're getting just a partition of the node or if you should have the full node with exclusive access for your allocation? Could you also post the information from cat /proc/cpuinfo from that node?

@AquifersBSIM
Copy link
Author

Hello @pentschev, regarding the "os.sched_setaffinity", I had the line commented out.

Regarding the do you know if you're getting just a partition of the node or if you should have the full node with exclusive access for your allocation? question. I am sure I am just getting a partition of the node.

Information from print_affinity2.py

[0]
[32768]
[0]

Information from cat /proc/cpuinfo
The information is very lengthy, and if its alright, here is a snippet of it

processor       : 1
vendor_id       : AuthenticAMD
cpu family      : 25
model           : 1
model name      : AMD EPYC 7543 32-Core Processor
stepping        : 1
microcode       : 0xa0011d5
cpu MHz         : 3662.940
cache size      : 512 KB
physical id     : 0
siblings        : 32
core id         : 1
cpu cores       : 32
apicid          : 1
initial apicid  : 1
fpu             : yes
fpu_exception   : yes
cpuid level     : 16
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rd
tscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt
 aes xsave avx f16c rdrand lahf_lm cmp_legacy extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb
 bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 invpcid cqm rdt_a rdseed adx smap clflus
hopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr rdpru wbnoinvd amd_ppin brs arat
npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold v_vmsave_vmload vgif v_spec_ctrl umip pku ospke vaes vpcl
mulqdq rdpid overflow_recov succor smca debug_swap
bugs            : sysret_ss_attrs spectre_v1 spectre_v2 spec_store_bypass srso
bogomips        : 5589.37
TLB size        : 2560 4K pages
clflush size    : 64
cache_alignment : 64
address sizes   : 48 bits physical, 48 bits virtual
power management: ts ttp tm hwpstate cpb eff_freq_ro [13] [14]

@pentschev
Copy link
Member

pentschev commented Oct 3, 2024

So if you're getting only a partition of the node, does that mean you don't have access to all the CPU cores as well? That could be the reason why properly determining the CPU affinity fails, and TBH I have no experience with that sort of partitioning and don't know if that is indeed supported by NVML either. If you know details, can you provide more information about the CPU status, e.g., how many physical CPUs (i.e., sockets) are there, how many cores you actually see with /proc/cpuinfo, and anything else that you can provide for us to better understand what the topology of the system/cluster?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants