-
Notifications
You must be signed in to change notification settings - Fork 868
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
UVM buffers failing in cuIpcGetMemHandle ? #6799
Comments
It seems to fail inside libpsm2, which might not handle the "managed" CUDA allocations correctly and try to use IPCs on them (which they should not). Any idea who to assign this to for PSM2 support ? |
Hi thanks - in the meantime tracked all the way back through PSM2 to a bug in CUDA (recompiled PSM2 from source, and inserted printf debugging, had a lovely 24h...) Regarding: Using Unified Virtual Memory (cudaMallocManaged) still leads to errors:
Arising when using multiple GPU’s per node with UVM. Investigating the source: A bug in cuPointerGetAttribute causes PSM2 to interpret UVM pointers as Device pointers, and try to use Cuda IPC to communicate A now simple code, removed from PSM2, FAILED in Grid with:
Via
———————
CUDA is supposed to returning one of:
But it is reporting that Unified memory is device memory incorrectly, causing PSM2 to do bad things, ——————
This causes the test in PSM2 to interpret UVM pointers as CU_MEMORYTYPE_DEVICE and try and ——————
——————
Returns error, but still does not produce CU_MEMORYTYPE_UNIFIED => Bug in current version of CUDA. However, absence of cudaSuccess will probably make PSM2 work. Except on our system CUDA driver version is insufficient for CUDA runtime version Will report this to CUDA. Even the CUDA 10 version does not return CU_MEMORYTYPE_UNIFIED and returns invalid arguments error! |
Closing |
The cuda 10.1 execution of the code was run on a system with the cuda 9.2 kernel driver. it is possible (Tim Lanfear produced same output with cuda 9.2 and 10.1) that the 10.1 behaviour will match 9.2 once the kernel driver is updated. This is on a centrally run supercomputer, so I can't update kernel drivers to check. |
I have filed: |
Possibly related to: |
Worth reopening. The bug is not OpenMPI but it prevents use of OpenMPI on important architectures. |
This isn't going to be addressed in the v4.0.x release stream so removing the v4.0.x lable |
Reviewing old issues - it appears that Adam submitted a pair of patches to PSM2 for this back in 2019. @paboyle - I know it's been a ridiculously long time but do you know if this was fixed in more recent IFS releases or is this still a problem for you? |
We recently audited the PSM GitHub, and it looks like they implemented/accepted a similar patch to what I proposed, but didn't come back and update the issue I filed. |
I'd be happy to close this issue? |
I suspect that's exactly what happened. Thanks for the fast reply and for closing the issue. |
Background information
I'm running OpenMPI 4.0.1 self compiled over Omnipath with IFS 10.8, as distributed by Intel.
The boards are
The good news is that MPI appears to work between nodes, where these buffers are sent from explicit device memory.
However when I run four MPI ranks per node and ensure that communications between ranks use unified virtual memory (UVM) allocated with cudaMallocManaged(), I get a failure:
When I run with a patch to the code to use explicit host memory the code succeeds.
However, I want to be able to run these buffers from UVM and have loops with either host or device execution policy fill them, as that is how the code was designed to operate.
What version of Open MPI are you using? (e.g., v1.10.3, v2.1.0, git branch name and hash, etc.)
v4.0.1
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
./configure CC=gcc CXX=g++ --prefix=/home/dp008/dp008/paboyle/Modules/openmpi/install/ --with-psm2-libdir=/lib64/ --with-cuda=/tessfs1/sw/cuda/9.2/ --enable-orterun-prefix-by-default
Compiled with gcc set to 7.3.0
Please describe the system on which you are running
Redhat Centos 7.4
HPE XA780i
Dual skylake 4116, 12+12 core.
Two OPA dual port HFI's.
Four V100 SXM2.
96GB RAM.
Two OPA dual port HFI's.
Details of the problem
When I run four MPI ranks per node and ensure that communications between ranks use unified virtual memory (UVM) allocated with cudaMallocManaged(), I get a failure:
When I run with a patch to the code to use explicit host memory the code succeeds.
However, I want to be able to run these buffers from UVM and have loops with either host or device execution policy fill them, as that is how the code was designed to operate.
Running the unmodified code with one rank per node works, so the UVM is working as a source for network traffic, but not as a source for intra-node traffic between GPUs.
Is there something I need to configure differently (I admit this is a complex environment so
I could be missing something !)
The text was updated successfully, but these errors were encountered: