-
Notifications
You must be signed in to change notification settings - Fork 91
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DASK Deployment using SLURM with GPUs #1381
Comments
Could you please report the output of print_affinity.pyimport pynvml
from dask_cuda.utils import get_cpu_affinity
pynvml.nvmlInit()
for i in range(pynvml.nvmlDeviceGetCount()):
cpu_affinity = get_cpu_affinity(i)
print(type(get_cpu_affinity(i)), get_cpu_affinity(i)) |
Hi @pentschev , I have forgotted to mention that I have disabled the "os.sched_setaffinity(0, self.cores)", as attached below
|
Keep in mind doing that will likely result in degraded performance. Here's a previous comment I wrote about this on a similar issue. |
Thank you @pentschev for the reply on me disabling the os.sched_setaffinity. I probably need some time to report the output of Regarding the "print_affinity.py": |
Hi @pentschev, Here are the reports of nvidia-smi topo -m output
print_affinity.py output
|
@AquifersBSIM can you clarify what you mean by "I have not enabled the os.sched_setaffinity"? Do you mean that when you ran the above you had the line commented out as in your previous #1381 (comment)? If so, that doesn't really matter for the experiment above. In any case, that unfortunately didn't really clarify whether the failure was in obtaining the CPU affinity or something else happened. Would you please run the following modified version of the script on the compute node? print_affinity2.py
Furthermore, the output of |
Hello @pentschev, regarding the "os.sched_setaffinity", I had the line commented out. Regarding the do you know if you're getting just a partition of the node or if you should have the full node with exclusive access for your allocation? question. I am sure I am just getting a partition of the node. Information from
Information from
|
So if you're getting only a partition of the node, does that mean you don't have access to all the CPU cores as well? That could be the reason why properly determining the CPU affinity fails, and TBH I have no experience with that sort of partitioning and don't know if that is indeed supported by NVML either. If you know details, can you provide more information about the CPU status, e.g., how many physical CPUs (i.e., sockets) are there, how many cores you actually see with |
Describe the issue:
I am running into an issue with deploying dask using LocalCUDACluster() on an HPC. I am trying to do RandomForest, and the amount of data I am inputting exits the limit of a single GPU. Hence, I am trying to utilize several GPUs to split the datasets. To start with I did, the following is just an example script (from DASK GitHub front page) which is shown in the code:
Minimal Complete Verifiable Example:
In addition to that, I have this submission script
Error Message
Anything else we need to know?:
The traceback was pretty long, I gave only a snippet of it
Environment:
The text was updated successfully, but these errors were encountered: