Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UVM buffers failing in cuIpcGetMemHandle ? #6799

Closed
paboyle opened this issue Jul 8, 2019 · 12 comments
Closed

UVM buffers failing in cuIpcGetMemHandle ? #6799

paboyle opened this issue Jul 8, 2019 · 12 comments
Assignees
Labels

Comments

@paboyle
Copy link

paboyle commented Jul 8, 2019

Background information

I'm running OpenMPI 4.0.1 self compiled over Omnipath with IFS 10.8, as distributed by Intel.

The boards are

  • HPE XA with
  • 4 x Nvidia Volta V100 GPU's and
  • 4 OPA 100Gb ports on two PCIe dual port HFI cards.

The good news is that MPI appears to work between nodes, where these buffers are sent from explicit device memory.

However when I run four MPI ranks per node and ensure that communications between ranks use unified virtual memory (UVM) allocated with cudaMallocManaged(), I get a failure:

r6i6n7.218497 Benchmark_dwf: CUDA failure: cuIpcGetMemHandle() (at /nfs/site/home/phcvs2/gitrepo/ifs-all/Ofed_Delta/rpmbuild/BUILD/libpsm2-11.2.23/ptl_am/am_reqrep_shmem.c:1977)returned 1 
r6i6n7.218497 Error returned from CUDA function.

When I run with a patch to the code to use explicit host memory the code succeeds.
However, I want to be able to run these buffers from UVM and have loops with either host or device execution policy fill them, as that is how the code was designed to operate.

What version of Open MPI are you using? (e.g., v1.10.3, v2.1.0, git branch name and hash, etc.)

v4.0.1

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

./configure CC=gcc CXX=g++ --prefix=/home/dp008/dp008/paboyle/Modules/openmpi/install/ --with-psm2-libdir=/lib64/ --with-cuda=/tessfs1/sw/cuda/9.2/ --enable-orterun-prefix-by-default

Compiled with gcc set to 7.3.0

Please describe the system on which you are running

  • Operating system/version:

Redhat Centos 7.4

  • Computer hardware:

HPE XA780i
Dual skylake 4116, 12+12 core.
Two OPA dual port HFI's.
Four V100 SXM2.
96GB RAM.

  • Network type:

Two OPA dual port HFI's.


Details of the problem

When I run four MPI ranks per node and ensure that communications between ranks use unified virtual memory (UVM) allocated with cudaMallocManaged(), I get a failure:

r6i6n7.218497 Benchmark_dwf: CUDA failure: cuIpcGetMemHandle() (at /nfs/site/home/phcvs2/gitrepo/ifs-all/Ofed_Delta/rpmbuild/BUILD/libpsm2-11.2.23/ptl_am/am_reqrep_shmem.c:1977)returned 1 
r6i6n7.218497 Error returned from CUDA function.

When I run with a patch to the code to use explicit host memory the code succeeds.
However, I want to be able to run these buffers from UVM and have loops with either host or device execution policy fill them, as that is how the code was designed to operate.

Running the unmodified code with one rank per node works, so the UVM is working as a source for network traffic, but not as a source for intra-node traffic between GPUs.

Is there something I need to configure differently (I admit this is a complex environment so
I could be missing something !)

@sjeaugey
Copy link
Member

sjeaugey commented Jul 8, 2019

It seems to fail inside libpsm2, which might not handle the "managed" CUDA allocations correctly and try to use IPCs on them (which they should not).

Any idea who to assign this to for PSM2 support ?

@paboyle
Copy link
Author

paboyle commented Jul 8, 2019

Hi thanks -

in the meantime tracked all the way back through PSM2 to a bug in CUDA (recompiled PSM2 from source, and inserted printf debugging, had a lovely 24h...)

Regarding: Using Unified Virtual Memory (cudaMallocManaged) still leads to errors:

r6i6n7.80511Benchmark_dwf: CUDA failure: cuIpcGetMemHandle() (at /nfs/site/home/phcvs2/gitrepo/ifs-all/Ofed_Delta/rpmbuild/BUILD/libpsm2-11.2.23/ptl_am/am_reqrep_shmem.c:1977)returned 1
r6i6n7.80511Error returned from CUDA function. 

Arising when using multiple GPU’s per node with UVM. Investigating the source:

https://github.com/intel/opa-psm2/blob/816c0dbdf911dba097dcbb09f023c5113713c33e/ptl_am/am_reqrep_shmem.c#L1973

A bug in cuPointerGetAttribute causes PSM2 to interpret UVM pointers as Device pointers, and try to use Cuda IPC to communicate
intra-node, with subsequent failure.

A now simple code, removed from PSM2, FAILED in Grid with:

      auto err = cudaMallocManaged((void **)&ptr,bytes);
      assert(err == cudaSuccess );
      CUmemorytype mt;
      err = cuPointerGetAttribute( &mt, CU_POINTER_ATTRIBUTE_MEMORY_TYPE, (CUdeviceptr) ptr);
      assert(err == cudaSuccess );
      printf("alignedAllocator %lx %d\n",(uint64_t )ptr, mt);fflush(stdout);
      assert (mt == CU_MEMORYTYPE_UNIFIED);

Via

      alignedAllocator 7ffdc4000000 2
      Benchmark_dwf: /tessfs1/home/dp008/dp008/paboyle/GPU-unified/Grid/Grid/allocator/AlignedAllocator.h:186: _Tp* Grid::alignedAllocator<_Tp>::allocate(Grid::alig
nedAllocator<_Tp>::size_type, const void*) [with _Tp = Grid::iScalar<Grid::iVector<Grid::iVector<Grid::Grid_simd<thrust::complex<float>, Grid::GpuVector<4, Grid::GpuComplex<float2> > >, 3>, 4> >; Grid::alignedAllocator<_Tp>::pointer = Grid::iScalar<Grid::iVector<Grid::iVector<Grid::Grid_simd<thrust::complex<float>,
 Grid::GpuVector<4, Grid::GpuComplex<float2> > >, 3>, 4> >*; Grid::alignedAllocator<_Tp>::size_type = long unsigned int]: Assertion `(mt == CU_MEMORYTYPE_UNIFIED)' failed.

———————
Even simpler 16 line of code example that fails:
———————

#include <cuda.h>
#include <cuda_runtime_api.h>
#include <cassert>
#include <stdio.h>
int main(int argc, char**argv)
{
 unsigned long bytes = 1024*1024;
 void *ptr;
 auto err = cudaMallocManaged((void **)&ptr,bytes);
 assert(err == cudaSuccess );
 CUmemorytype mt;
 auto perr = cuPointerGetAttribute( &mt, CU_POINTER_ATTRIBUTE_MEMORY_TYPE, (CUdeviceptr) ptr);
 assert(perr == cudaSuccess );
 printf("alignedAllocator %lx %d\n",(uint64_t )ptr, mt);fflush(stdout);
 assert (mt == CU_MEMORYTYPE_UNIFIED);
}

CUDA is supposed to returning one of:

CU_MEMORYTYPE_HOST = 0x01 Host memory 
CU_MEMORYTYPE_DEVICE = 0x02 Device memory 
CU_MEMORYTYPE_ARRAY = 0x03 Array memory 
CU_MEMORYTYPE_UNIFIED = 0x04 Unified device or host memory

But it is reporting that Unified memory is device memory incorrectly, causing PSM2 to do bad things,
like think it can use Cuda IPC.

——————
CUDA 9.2
——————
nvcc simple.cc -o simple.x -lcuda

[paboyle@r6i6n6 ~]$ ./simple.x 
alignedAllocator 7fff88000000 2
simple.x: simple.cc:17: int main(int, char**): Assertion `mt == CU_MEMORYTYPE_UNIFIED' failed.
Aborted

This causes the test in PSM2 to interpret UVM pointers as CU_MEMORYTYPE_DEVICE and try and
fail to use Cuda IPC on UVM.

——————
CUDA 9.1 .. Officially supported with IFS 10.8
——————

./simple.x 
alignedAllocator 7fff88000000 2
simple.x: simple.cc:17: int main(int, char**): Assertion `mt == CU_MEMORYTYPE_UNIFIED' failed.
Aborted

——————
CUDA 10.1
——————

[paboyle@r6i6n6 ~]$ ./simple.x 
simple.x: simple.cc:12: int main(int, char**): Assertion `err == cudaSuccess' failed.
Aborted

Returns error, but still does not produce CU_MEMORYTYPE_UNIFIED => Bug in current version of CUDA.

However, absence of cudaSuccess will probably make PSM2 work. Except on our system
PSM2 in IFS 10.8 does not support CUDA 10.1, failing with:

CUDA driver version is insufficient for CUDA runtime version

Will report this to CUDA. Even the CUDA 10 version does not return CU_MEMORYTYPE_UNIFIED and returns invalid arguments error!

@paboyle
Copy link
Author

paboyle commented Jul 9, 2019

Closing

@paboyle
Copy link
Author

paboyle commented Jul 10, 2019

The cuda 10.1 execution of the code was run on a system with the cuda 9.2 kernel driver. it is possible (Tim Lanfear produced same output with cuda 9.2 and 10.1) that the 10.1 behaviour will match 9.2 once the kernel driver is updated.

This is on a centrally run supercomputer, so I can't update kernel drivers to check.

@paboyle
Copy link
Author

paboyle commented Jul 11, 2019

I have filed:

cornelisnetworks/opa-psm2#41

@paboyle
Copy link
Author

paboyle commented Jul 11, 2019

Possibly related to:

#4899

@paboyle paboyle reopened this Jul 12, 2019
@paboyle
Copy link
Author

paboyle commented Jul 12, 2019

Worth reopening. The bug is not OpenMPI but it prevents use of OpenMPI on important architectures.
Help applying pressure on gettging what I think is the PSM2 / CUDA incompatibility may arise.

@hppritcha
Copy link
Member

This isn't going to be addressed in the v4.0.x release stream so removing the v4.0.x lable

@mwheinz
Copy link

mwheinz commented Apr 13, 2021

Reviewing old issues - it appears that Adam submitted a pair of patches to PSM2 for this back in 2019.

@paboyle - I know it's been a ridiculously long time but do you know if this was fixed in more recent IFS releases or is this still a problem for you?

@paboyle
Copy link
Author

paboyle commented Apr 13, 2021

We recently audited the PSM GitHub, and it looks like they implemented/accepted a similar patch to what I proposed, but didn't come back and update the issue I filed.

@paboyle
Copy link
Author

paboyle commented Apr 13, 2021

I'd be happy to close this issue?

@paboyle paboyle closed this as completed Apr 13, 2021
@mwheinz
Copy link

mwheinz commented Apr 13, 2021

We recently audited the PSM GitHub, and it looks like they implemented/accepted a similar patch to what I proposed, but didn't come back and update the issue I filed.

I suspect that's exactly what happened. Thanks for the fast reply and for closing the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants