DASK Deployment using SLURM with GPUs #8857

AquifersBSIM · 2024-09-04T09:13:24Z

Describe the issue:
I am running into an issue with deploying dask using LocalCUDACluster() on an HPC. I am trying to do RandomForest, and the amount of data I am inputting exits the limit of a single GPU. Hence, I am trying to utilize several GPUs to split the datasets. To start with I did, the following is just an example script (from DASK GitHub front page) which is shown in the code:

Minimal Complete Verifiable Example:

import glob

def main():

    # Read CSV file in parallel across workers
    import dask_cudf
    df = dask_cudf.read_csv(glob.glob("*.csv"))

    # Fit a NearestNeighbors model and query it
    from cuml.dask.neighbors import NearestNeighbors
    nn = NearestNeighbors(n_neighbors = 10, client=client)
    nn.fit(df)
    neighbors = nn.kneighbors(df)

if __name__ == "__main__":

    # Initialize UCX for high-speed transport of CUDA arrays
    from dask_cuda import LocalCUDACluster

    # Create a Dask single-node CUDA cluster w/ one worker per device
    cluster = LocalCUDACluster()
    
    from dask.distributed import Client
    client = Client(cluster)

    main()

In addition to that, I have this submission script

#!/bin/bash
#
#SBATCH --job-name=dask_examples
#SBATCH --output=dask_examples.txt
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=8
#SBATCH --ntasks-per-node=1
#SBATCH --time=01:00:00
#SBATCH --mem-per-cpu=5G
#SBATCH --gres=gpu:4
ml conda
conda activate /fred/oz241/BSIM/conda_SVM/SVM
/usr/bin/time -v python 1.py

Error Message

Task exception was never retrieved
future: <Task finished name='Task-543' coro=<_wrap_awaitable() done, defined at /fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/depl
oy/spec.py:124> exception=RuntimeError('Worker failed to start.')>
Traceback (most recent call last):
  File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/core.py", line 523, in start
    await wait_for(self.start_unsafe(), timeout=timeout)
    ^^^^^^^^^^^^^^^^^
  File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/utils.py", line 1952, in wait_for
    return await fut
  File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/worker.py", line 1474, in start_unsafe
    raise plugins_exceptions[0]
    ^^^^^^^^^^^^^^^^^
  File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/utils.py", line 837, in wrapper
    return await func(*args, **kwargs)
    ^^^^^^^^^^^^^^^^^
  File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/worker.py", line 1876, in plugin_add
    result = plugin.setup(worker=self)
    ^^^^^^^^^^^^^^^^^
  File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/dask_cuda/plugins.py", line 14, in setup
    os.sched_setaffinity(0, self.cores)
    ^^^^^^^^^^^^^^^^^
OSError: [Errno 22] Invalid argument

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/deploy/spec.py", line 125, in _wrap_awaitable
    return await aw
           ^^^^^^^^
  File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/core.py", line 512, in start
    raise self.__startup_exc
  File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/core.py", line 523, in start
    await wait_for(self.start_unsafe(), timeout=timeout)
  File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/utils.py", line 1952, in wait_for
    return await fut
           ^^^^^^^^^
  File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/nanny.py", line 369, in start_unsafe
    response = await self.instantiate()
               ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/nanny.py", line 452, in instantiate
    result = await self.process.start()
             ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/nanny.py", line 759, in start
    msg = await self._wait_until_connected(uid)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/nanny.py", line 901, in _wait_until_connected
    raise msg["exception"]
  File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/nanny.py", line 965, in run
    async with worker:
    ^^^^^^^^^^^^^^^^^
  File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/core.py", line 537, in __aenter__
    await self
  File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/core.py", line 531, in start
    raise RuntimeError(f"{type(self).__name__} failed to start.") from exc
    ^^^^^^^^^^^^^^^^^
RuntimeError: Worker failed to start.
Task exception was never retrieved
future: <Task finished name='Task-541' coro=<_wrap_awaitable() done, defined at /fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/depl
oy/spec.py:124> exception=RuntimeError('Worker failed to start.')>
Traceback (most recent call last):
  File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/core.py", line 523, in start
    await wait_for(self.start_unsafe(), timeout=timeout)
    ^^^^^^^^^^^^^^^^^
  File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/utils.py", line 1952, in wait_for
    return await fut
  File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/worker.py", line 1474, in start_unsafe
    raise plugins_exceptions[0]
    ^^^^^^^^^^^^^^^^^
  File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/utils.py", line 837, in wrapper
    return await func(*args, **kwargs)
    ^^^^^^^^^^^^^^^^^
  File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/worker.py", line 1876, in plugin_add
    result = plugin.setup(worker=self)
    ^^^^^^^^^^^^^^^^^
  File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/dask_cuda/plugins.py", line 14, in setup
    os.sched_setaffinity(0, self.cores)
    ^^^^^^^^^^^^^^^^^
OSError: [Errno 22] Invalid argument

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/deploy/spec.py", line 125, in _wrap_awaitable
    return await aw
           ^^^^^^^^
  File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/core.py", line 512, in start
    raise self.__startup_exc
  File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/core.py", line 523, in start
    await wait_for(self.start_unsafe(), timeout=timeout)
  File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/utils.py", line 1952, in wait_for
    return await fut
           ^^^^^^^^^
  File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/nanny.py", line 369, in start_unsafe
    response = await self.instantiate()
               ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/nanny.py", line 452, in instantiate
    result = await self.process.start()
             ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/nanny.py", line 759, in start
    msg = await self._wait_until_connected(uid)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/nanny.py", line 901, in _wait_until_connected
    raise msg["exception"]
  File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/nanny.py", line 965, in run
    async with worker:
    ^^^^^^^^^^^^^^^^^
  File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/core.py", line 537, in __aenter__
    await self
  File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/core.py", line 531, in start
    raise RuntimeError(f"{type(self).__name__} failed to start.") from exc
    ^^^^^^^^^^^^^^^^^
RuntimeError: Worker failed to start.

Anything else we need to know?:
The traceback was pretty long, I gave only a snippet of it

Environment:

Dask version: 2024.7.1
dask-jobqueue: 0.9.0
Python version: 3.11.9
Operating System: Linux (Slurm HPC)
Install method (conda, pip, source): conda

The text was updated successfully, but these errors were encountered:

jacobtomlinson · 2024-09-04T13:58:42Z

Given that this is related to LocalCUDACluster I would recommend opening this issue on https://github.com/rapidsai/dask-cuda instead. Unfortunately GitHub doesn't allow me to transfer this between orgs.

AquifersBSIM · 2024-09-06T02:37:50Z

Thank you so much @jacobtomlinson for pointing me.

jrbourbeau · 2024-09-06T16:55:10Z

Thanks @AquifersBSIM @jacobtomlinson. Closing this in favor of rapidsai/dask-cuda#1381

github-actions bot added the needs triage label Sep 4, 2024

jrbourbeau closed this as completed Sep 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DASK Deployment using SLURM with GPUs #8857

DASK Deployment using SLURM with GPUs #8857

AquifersBSIM commented Sep 4, 2024 •

edited

Loading

jacobtomlinson commented Sep 4, 2024

AquifersBSIM commented Sep 6, 2024

jrbourbeau commented Sep 6, 2024

DASK Deployment using SLURM with GPUs #8857

DASK Deployment using SLURM with GPUs #8857

Comments

AquifersBSIM commented Sep 4, 2024 • edited Loading

jacobtomlinson commented Sep 4, 2024

AquifersBSIM commented Sep 6, 2024

jrbourbeau commented Sep 6, 2024

AquifersBSIM commented Sep 4, 2024 •

edited

Loading