Skip to content

sporadic fatal error messages due to critical bug in madvise() hook with OpenIB #4509

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
bingmann opened this issue Nov 16, 2017 · 14 comments
Closed

Comments

@bingmann
Copy link

I believe there is a critical bug in the (new?) madvise() hook:

In a program that does lots of Isend()/Irecv() with Wait()/Test(), I
sporadically see something like the following fatal error message:

--------------------------------------------------------------------------
Open MPI intercepted a call to free memory that is still being used by
an ongoing MPI communication.  This usually reflects an error in the
MPI application; it may signify memory corruption.  Open MPI will now
abort your job.

  rcache name:    grdma
  Local host:     fh1n076
  Buffer address: 0x2af75c4fc000
  Buffer size:    2568192
--------------------------------------------------------------------------

The error message is bogus and suspicious, especially since the program never allocates a buffer of that size. Nor does it use that address as a pointer, but the address is contained in some memory area used.

OpenMPI Version: 3.0.0, installed from the source tarball with debug output. Probably affects all versions with the madvise() commits backported.

Running on a Linux HPC cluster, Kernel 3.10.0-693.2.2.el7.x86_64, with an InfiniBand 4X FDR Interconnect, glibc 2.17, gcc 5.2.0.

The error only occurs with openib BTL, with TCP it apparently never occurs, because the grdma rcache module is not used.

I believe the bug affects all programs using asynchronous communication, openib, and varying buffer sizes. It occurs naturally after running the program for some time. I have added a test program triggering the error artificially.

Backtrace and Autopsy

Lots of debugging leads me to believe there is a bug in the way the interception of madvise() clears memory from the rcache grdma, which frees RDMA memory regions.

The fatal error message occurs when _intercept_madvise() is called,
which calls opal_mem_hooks_release_hook(),
which calls mca_rcache_base_mem_cb(),
which contain the fatal error message.

mca_rcache_base_mem_cb() is supposed to free rcache allocations and prints the message when mca_rcache_grdma_invalidate_range() fails.
The deallocation of areas happens by iterating over the memory area tree using mca_rcache_base_vma_iterate(), and calling gc_add() for areas to invalidate.

gc_add() fails if the invalidated area still has reference counts.

The issues is that madvise() is called in my program by the libc's malloc implementation with MADV_DONTNEED to free up regions no longer needed. This occurs at unpredictable times, probably when malloc decides to consolidate free space in the heap.

The fatal error occurs after the following sequences of operations:

  • allocate and send a (large) buffer. this registers the memory area in grdma.
  • free the buffer. after completion of the send the registration remains cached.
  • allocate a smaller buffer. by chance, malloc() reuses the same memory address for the smaller allocation.
  • perform an MPI_Isend() on the smaller buffer. this raises the reference count of the cached larger memory registration.
  • malloc() decides to consolidate free heap memory, calling madvise() on the second part of our memory area.

This triggers the fatal error, because the cached registration of the large memory area is still marked as used.

The fundamental bug, I believe, is that mca_rcache_base_vma_iterate() returns all memory areas overlapping (? did not check) the queried area. Hence, _intercept_madvise() attempts to free all areas that overlap the area in question.

I believe the right behaviour would be to only free areas fully covered by the madvise() call. While this would lead to some areas not being freed, the current state leads to random fatal aborts. Disabling the _intercept_madvise() hook poses a temporary work-around.

Can someone confirm this bug and maybe its solution?
I also (currently) do not have enough experience with the OpenMPI codebase to write a patch.

I have attached a program which triggers the error by artificially calling madvise(). In my real application is the sporadically done from inside the libc. The error only occurs when using OpenIB over a real InfiniBand network, it does not occur when running with shared-memory or TCP.

test_madvise.cpp.txt

@jsquyres
Copy link
Member

@hjelmn Can you have a look?

(sorry for the delay -- this report came in during the Supercomputing trade show, and this week is the US Thanksgiving holiday)

@hppritcha
Copy link
Member

@hjelmn can you look at this?

@hjelmn
Copy link
Member

hjelmn commented Dec 6, 2017

You should not disable the hook. That will lead to incorrect program behavior. The problem is that we treat all cases of invalidation of a region as an error if any part of the region is in use. If part of a region is unmapped (or madvise MADV_DONTNEED) then the entire region should be invalidated. We need to suppress an error in this case. I am working on a fix now. It won't be perfect but we are not required to provide an error to users who free in-use memory. I would much rather have no false positives then have a false negative. If I can't find a good solution I will remove the error message entirely.

hjelmn added a commit to hjelmn/ompi that referenced this issue Dec 6, 2017
It is possible to have parts of an in-use registered region be passed
to munmap or madvise. This does not necessarily mean the user has made
an error but does mean the entire region should be invalidated. This
commit checks that the munmap or madvise base matches the beginning of
the cached region. If it does and the region is in-use then we print
an error. There will certainly be false-negatives where a user
unmaps something that really is in-use but that is preferrable to a
false-positive.

References open-mpi#4509

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
@jsquyres
Copy link
Member

jsquyres commented Dec 6, 2017

See #4576.

@hjelmn
Copy link
Member

hjelmn commented Dec 11, 2017

@bingmann Have you tried running with #4576? Would like to know if it fixes the issue for you.

hjelmn added a commit that referenced this issue Dec 12, 2017
It is possible to have parts of an in-use registered region be passed
to munmap or madvise. This does not necessarily mean the user has made
an error but does mean the entire region should be invalidated. This
commit checks that the munmap or madvise base matches the beginning of
the cached region. If it does and the region is in-use then we print
an error. There will certainly be false-negatives where a user
unmaps something that really is in-use but that is preferrable to a
false-positive.

References #4509

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
hjelmn added a commit to hjelmn/ompi that referenced this issue Dec 12, 2017
It is possible to have parts of an in-use registered region be passed
to munmap or madvise. This does not necessarily mean the user has made
an error but does mean the entire region should be invalidated. This
commit checks that the munmap or madvise base matches the beginning of
the cached region. If it does and the region is in-use then we print
an error. There will certainly be false-negatives where a user
unmaps something that really is in-use but that is preferrable to a
false-positive.

References open-mpi#4509

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
(cherry picked from commit d3fa1bb)
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
@jsquyres
Copy link
Member

@bingmann We just merged #4576 into master, and therefore the fix will be included in the nightly snapshot tarball tonight (https://www.open-mpi.org/nightly/master/). Can you try a master nightly snapshot after tonight and let us know if it worked for you?

hjelmn added a commit to hjelmn/ompi that referenced this issue Dec 12, 2017
It is possible to have parts of an in-use registered region be passed
to munmap or madvise. This does not necessarily mean the user has made
an error but does mean the entire region should be invalidated. This
commit checks that the munmap or madvise base matches the beginning of
the cached region. If it does and the region is in-use then we print
an error. There will certainly be false-negatives where a user
unmaps something that really is in-use but that is preferrable to a
false-positive.

References open-mpi#4509

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
(cherry picked from commit d3fa1bb)
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
hjelmn added a commit to hjelmn/ompi that referenced this issue Dec 12, 2017
It is possible to have parts of an in-use registered region be passed
to munmap or madvise. This does not necessarily mean the user has made
an error but does mean the entire region should be invalidated. This
commit checks that the munmap or madvise base matches the beginning of
the cached region. If it does and the region is in-use then we print
an error. There will certainly be false-negatives where a user
unmaps something that really is in-use but that is preferrable to a
false-positive.

References open-mpi#4509

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
(cherry picked from commit d3fa1bb)
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
@markalle
Copy link
Contributor

Is it known if the madvise() hook is needed? Going back to the origins on our side of the fence I was the one who guessed that madvise() with MADV_DONTNEED/MADV_REMOVE might result in the virtual-to-physical mapping changing and thus that it should trigger pin cache invalidation. But I only did that out of an abundance of caution and without any real evidence behind it. Later when I tried making test programs explicitly trying to hit issues from madvise() I couldn't get anything to fail. So I'm leaning toward saying it's an unnecessary interception.

Of course unnecessary cache invalidations should still work, so I'm hoping the #4576 you linked to does resolve this.

@bingmann
Copy link
Author

Hmm. I finally got around to compiling from git.
My minimal test program test_madvise.cpp.txt from above, still crashes due to madvise.
I compiled the git versions

The minimal test program segfaults on my small Infiniband test cluster:
[i10pc151:27627] *** Process received signal ***
[i10pc151:27627] Signal: Segmentation fault (11)
[i10pc151:27627] Signal code: Address not mapped (1)
[i10pc151:27627] Failing at address: 0xa0
[i10pc151:27627] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x36cb0)[0x7fa13fe74cb0]
[i10pc151:27627] [ 1] /home/bingmann/ompi-debug/lib/openmpi/mca_rcache_grdma.so(+0x1e01)[0x7fa13424fe01]
[i10pc151:27627] [ 2] /home/bingmann/ompi-debug/lib/libopen-pal.so.0(mca_rcache_base_module_destroy+0x8f)[0x7fa13f941c6f]
[i10pc151:27627] [ 3] /home/bingmann/ompi-debug/lib/openmpi/mca_btl_openib.so(+0xe8ef)[0x7fa12f3ae8ef]
[i10pc151:27627] [ 4] /home/bingmann/ompi-debug/lib/openmpi/mca_btl_openib.so(mca_btl_openib_finalize+0x530)[0x7fa12f3a96b0]
[i10pc151:27627] [ 5] /home/bingmann/ompi-debug/lib/libopen-pal.so.0(+0x74dd0)[0x7fa13f8eedd0]
[i10pc151:27627] [ 6] /home/bingmann/ompi-debug/lib/libopen-pal.so.0(mca_base_framework_close+0x79)[0x7fa13f8d7239]
[i10pc151:27627] [ 7] /home/bingmann/ompi-debug/lib/libopen-pal.so.0(mca_base_framework_close+0x79)[0x7fa13f8d7239]
[i10pc151:27627] [ 8] /home/bingmann/ompi-debug/lib/libmpi.so.0(ompi_mpi_finalize+0x831)[0x7fa1402524e1]
[i10pc151:27627] [ 9] ./test[0x400b23]
[i10pc151:27627] [10] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5)[0x7fa13fe5ff45]
[i10pc151:27627] [11] ./test[0x4008d9]
[i10pc151:27627] *** End of error message ***

If i remove the madvise() line, the test runs through.

@hjelmn
Copy link
Member

hjelmn commented Dec 21, 2017

The hook is needed because the kernel does indeed change the mappings on MADV_DONTNEED.

@hjelmn
Copy link
Member

hjelmn commented Dec 21, 2017

@bingmann Can you look at the core with gdb and get a backtrace with line numbers? I will see if I can reproduce the issue but this error seems different from the one from before.

@bingmann
Copy link
Author

Got the backtrace from a core dump:

Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007f4c35637462 in dereg_mem (reg=0x11e8f00) at rcache_grdma_module.c:136
136	        mca_rcache_base_vma_delete (rcache_grdma->cache->vma_module, reg);
(gdb) bt
#0  0x00007f4c35637462 in dereg_mem (reg=0x11e8f00) at rcache_grdma_module.c:136
#1  0x00007f4c35637546 in do_unregistration_gc (rcache=0x1039b10) at rcache_grdma_module.c:159
#2  0x00007f4c35638121 in mca_rcache_grdma_finalize (rcache=0x1039b10) at rcache_grdma_module.c:534
#3  0x00007f4c4106ccc3 in mca_rcache_base_module_destroy (module=0x1039b10) at base/rcache_base_create.c:113
#4  0x00007f4c349b9e00 in device_destruct (device=0x102d9b0) at btl_openib_component.c:963
#5  0x00007f4c349ae696 in opal_obj_run_destructors (object=0x102d9b0) at ../../../../opal/class/opal_object.h:462
#6  0x00007f4c349b4700 in mca_btl_openib_finalize_resources (btl=0x103bec0) at btl_openib.c:1714
#7  0x00007f4c349b4806 in mca_btl_openib_finalize (btl=0x103bec0) at btl_openib.c:1741
#8  0x00007f4c4100d258 in mca_btl_base_close () at base/btl_base_frame.c:202
#9  0x00007f4c40ff0068 in mca_base_framework_close (framework=0x7f4c412bde40 <opal_btl_base_framework>) at mca_base_framework.c:216
#10 0x00007f4c41a96853 in mca_bml_base_close () at base/bml_base_frame.c:130
#11 0x00007f4c40ff0068 in mca_base_framework_close (framework=0x7f4c41d4ba80 <ompi_bml_base_framework>) at mca_base_framework.c:216
#12 0x00007f4c41a11713 in ompi_mpi_finalize () at runtime/ompi_mpi_finalize.c:453
#13 0x00007f4c41a4f383 in PMPI_Finalize () at pfinalize.c:47
#14 0x0000000000400b23 in main (argc=1, argv=0x7ffdab49ee88) at test_madvise.cpp:44

Happy Holidays.

@hjelmn
Copy link
Member

hjelmn commented Dec 22, 2017

Found the issue. I fixed the erroneous error message but invalidated some internal assumptions. Fix incoming.

hjelmn added a commit to hjelmn/ompi that referenced this issue Dec 22, 2017
This commit fixes an issue when a registration is created for a large
region and then invalidated while part of it is in use.

References open-mpi#4509

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
hjelmn added a commit that referenced this issue Dec 22, 2017
This commit fixes an issue when a registration is created for a large
region and then invalidated while part of it is in use.

References #4509

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
hjelmn added a commit to hjelmn/ompi that referenced this issue Dec 22, 2017
This commit fixes an issue when a registration is created for a large
region and then invalidated while part of it is in use.

References open-mpi#4509

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
(cherry picked from commit 39d5988)
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
hjelmn added a commit to hjelmn/ompi that referenced this issue Dec 22, 2017
This commit fixes an issue when a registration is created for a large
region and then invalidated while part of it is in use.

References open-mpi#4509

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
(cherry picked from commit 39d5988)
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
hjelmn added a commit to hjelmn/ompi that referenced this issue Dec 22, 2017
This commit fixes an issue when a registration is created for a large
region and then invalidated while part of it is in use.

References open-mpi#4509

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
(cherry picked from commit 39d5988)
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
@bwbarrett
Copy link
Member

@hjelmn, can this be closed now?

@hjelmn
Copy link
Member

hjelmn commented Jan 16, 2018

Yup.

@hjelmn hjelmn closed this as completed Jan 16, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants