-
Notifications
You must be signed in to change notification settings - Fork 893
sporadic fatal error messages due to critical bug in madvise()
hook with OpenIB
#4509
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@hjelmn Can you have a look? (sorry for the delay -- this report came in during the Supercomputing trade show, and this week is the US Thanksgiving holiday) |
@hjelmn can you look at this? |
You should not disable the hook. That will lead to incorrect program behavior. The problem is that we treat all cases of invalidation of a region as an error if any part of the region is in use. If part of a region is unmapped (or madvise MADV_DONTNEED) then the entire region should be invalidated. We need to suppress an error in this case. I am working on a fix now. It won't be perfect but we are not required to provide an error to users who free in-use memory. I would much rather have no false positives then have a false negative. If I can't find a good solution I will remove the error message entirely. |
It is possible to have parts of an in-use registered region be passed to munmap or madvise. This does not necessarily mean the user has made an error but does mean the entire region should be invalidated. This commit checks that the munmap or madvise base matches the beginning of the cached region. If it does and the region is in-use then we print an error. There will certainly be false-negatives where a user unmaps something that really is in-use but that is preferrable to a false-positive. References open-mpi#4509 Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
See #4576. |
It is possible to have parts of an in-use registered region be passed to munmap or madvise. This does not necessarily mean the user has made an error but does mean the entire region should be invalidated. This commit checks that the munmap or madvise base matches the beginning of the cached region. If it does and the region is in-use then we print an error. There will certainly be false-negatives where a user unmaps something that really is in-use but that is preferrable to a false-positive. References #4509 Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
It is possible to have parts of an in-use registered region be passed to munmap or madvise. This does not necessarily mean the user has made an error but does mean the entire region should be invalidated. This commit checks that the munmap or madvise base matches the beginning of the cached region. If it does and the region is in-use then we print an error. There will certainly be false-negatives where a user unmaps something that really is in-use but that is preferrable to a false-positive. References open-mpi#4509 Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov> (cherry picked from commit d3fa1bb) Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
@bingmann We just merged #4576 into master, and therefore the fix will be included in the nightly snapshot tarball tonight (https://www.open-mpi.org/nightly/master/). Can you try a master nightly snapshot after tonight and let us know if it worked for you? |
It is possible to have parts of an in-use registered region be passed to munmap or madvise. This does not necessarily mean the user has made an error but does mean the entire region should be invalidated. This commit checks that the munmap or madvise base matches the beginning of the cached region. If it does and the region is in-use then we print an error. There will certainly be false-negatives where a user unmaps something that really is in-use but that is preferrable to a false-positive. References open-mpi#4509 Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov> (cherry picked from commit d3fa1bb) Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
It is possible to have parts of an in-use registered region be passed to munmap or madvise. This does not necessarily mean the user has made an error but does mean the entire region should be invalidated. This commit checks that the munmap or madvise base matches the beginning of the cached region. If it does and the region is in-use then we print an error. There will certainly be false-negatives where a user unmaps something that really is in-use but that is preferrable to a false-positive. References open-mpi#4509 Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov> (cherry picked from commit d3fa1bb) Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
Is it known if the madvise() hook is needed? Going back to the origins on our side of the fence I was the one who guessed that madvise() with MADV_DONTNEED/MADV_REMOVE might result in the virtual-to-physical mapping changing and thus that it should trigger pin cache invalidation. But I only did that out of an abundance of caution and without any real evidence behind it. Later when I tried making test programs explicitly trying to hit issues from madvise() I couldn't get anything to fail. So I'm leaning toward saying it's an unnecessary interception. Of course unnecessary cache invalidations should still work, so I'm hoping the #4576 you linked to does resolve this. |
Hmm. I finally got around to compiling from git.
The minimal test program segfaults on my small Infiniband test cluster: If i remove the madvise() line, the test runs through. |
The hook is needed because the kernel does indeed change the mappings on MADV_DONTNEED. |
@bingmann Can you look at the core with gdb and get a backtrace with line numbers? I will see if I can reproduce the issue but this error seems different from the one from before. |
Got the backtrace from a core dump:
Happy Holidays. |
Found the issue. I fixed the erroneous error message but invalidated some internal assumptions. Fix incoming. |
This commit fixes an issue when a registration is created for a large region and then invalidated while part of it is in use. References open-mpi#4509 Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This commit fixes an issue when a registration is created for a large region and then invalidated while part of it is in use. References #4509 Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This commit fixes an issue when a registration is created for a large region and then invalidated while part of it is in use. References open-mpi#4509 Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov> (cherry picked from commit 39d5988) Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This commit fixes an issue when a registration is created for a large region and then invalidated while part of it is in use. References open-mpi#4509 Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov> (cherry picked from commit 39d5988) Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This commit fixes an issue when a registration is created for a large region and then invalidated while part of it is in use. References open-mpi#4509 Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov> (cherry picked from commit 39d5988) Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
@hjelmn, can this be closed now? |
Yup. |
I believe there is a critical bug in the (new?)
madvise()
hook:In a program that does lots of Isend()/Irecv() with Wait()/Test(), I
sporadically see something like the following fatal error message:
The error message is bogus and suspicious, especially since the program never allocates a buffer of that size. Nor does it use that address as a pointer, but the address is contained in some memory area used.
OpenMPI Version: 3.0.0, installed from the source tarball with debug output. Probably affects all versions with the
madvise()
commits backported.Running on a Linux HPC cluster, Kernel 3.10.0-693.2.2.el7.x86_64, with an InfiniBand 4X FDR Interconnect, glibc 2.17, gcc 5.2.0.
The error only occurs with openib BTL, with TCP it apparently never occurs, because the grdma rcache module is not used.
I believe the bug affects all programs using asynchronous communication, openib, and varying buffer sizes. It occurs naturally after running the program for some time. I have added a test program triggering the error artificially.
Backtrace and Autopsy
Lots of debugging leads me to believe there is a bug in the way the interception of
madvise()
clears memory from the rcache grdma, which frees RDMA memory regions.The fatal error message occurs when
_intercept_madvise()
is called,which calls
opal_mem_hooks_release_hook()
,which calls
mca_rcache_base_mem_cb()
,which contain the fatal error message.
mca_rcache_base_mem_cb()
is supposed to free rcache allocations and prints the message whenmca_rcache_grdma_invalidate_range()
fails.The deallocation of areas happens by iterating over the memory area tree using
mca_rcache_base_vma_iterate()
, and callinggc_add()
for areas to invalidate.gc_add()
fails if the invalidated area still has reference counts.The issues is that
madvise()
is called in my program by the libc's malloc implementation withMADV_DONTNEED
to free up regions no longer needed. This occurs at unpredictable times, probably when malloc decides to consolidate free space in the heap.The fatal error occurs after the following sequences of operations:
malloc()
reuses the same memory address for the smaller allocation.MPI_Isend()
on the smaller buffer. this raises the reference count of the cached larger memory registration.malloc()
decides to consolidate free heap memory, callingmadvise()
on the second part of our memory area.This triggers the fatal error, because the cached registration of the large memory area is still marked as used.
The fundamental bug, I believe, is that
mca_rcache_base_vma_iterate()
returns all memory areas overlapping (? did not check) the queried area. Hence,_intercept_madvise()
attempts to free all areas that overlap the area in question.I believe the right behaviour would be to only free areas fully covered by the
madvise()
call. While this would lead to some areas not being freed, the current state leads to random fatal aborts. Disabling the_intercept_madvise()
hook poses a temporary work-around.Can someone confirm this bug and maybe its solution?
I also (currently) do not have enough experience with the OpenMPI codebase to write a patch.
I have attached a program which triggers the error by artificially calling
madvise()
. In my real application is the sporadically done from inside the libc. The error only occurs when using OpenIB over a real InfiniBand network, it does not occur when running with shared-memory or TCP.test_madvise.cpp.txt
The text was updated successfully, but these errors were encountered: