sporadic fatal error messages due to critical bug in `madvise()` hook with OpenIB

I believe there is a critical bug in the (new?) `madvise()` hook:

In a program that does lots of Isend()/Irecv() with Wait()/Test(), I
sporadically see something like the following fatal error message:

```
--------------------------------------------------------------------------
Open MPI intercepted a call to free memory that is still being used by
an ongoing MPI communication.  This usually reflects an error in the
MPI application; it may signify memory corruption.  Open MPI will now
abort your job.

  rcache name:    grdma
  Local host:     fh1n076
  Buffer address: 0x2af75c4fc000
  Buffer size:    2568192
--------------------------------------------------------------------------
```

The error message is bogus and suspicious, especially since the program never allocates a buffer of that size. Nor does it use that address as a pointer, but the address _is_ contained in some memory area used.

OpenMPI Version: 3.0.0, installed from the source tarball with debug output. Probably affects all versions with the `madvise()` commits backported.

Running on a Linux HPC cluster, Kernel 3.10.0-693.2.2.el7.x86_64, with an InfiniBand 4X FDR Interconnect, glibc 2.17, gcc 5.2.0.

The error **only occurs with openib BTL**, with TCP it apparently never occurs, because the grdma rcache module is not used.

I believe the bug affects all programs using asynchronous communication, openib, and varying buffer sizes. It occurs naturally after running the program for some time. I have added a test program triggering the error artificially.

### Backtrace and Autopsy

Lots of debugging leads me to believe there is a bug in the way the interception of `madvise()` clears memory from the rcache grdma, which frees RDMA memory regions.

The fatal error message occurs when `_intercept_madvise()` is called,
which calls `opal_mem_hooks_release_hook()`,
which calls `mca_rcache_base_mem_cb()`,
which contain the fatal error message.

`mca_rcache_base_mem_cb()` is supposed to free rcache allocations and prints the message when `mca_rcache_grdma_invalidate_range()` fails.
The deallocation of areas happens by iterating over the memory area tree using `mca_rcache_base_vma_iterate()`, and calling `gc_add()` for areas to invalidate.

`gc_add()` fails if the invalidated area still has reference counts.

The issues is that `madvise()` is called in my program by the libc's malloc implementation with `MADV_DONTNEED` to free up regions no longer needed.  This occurs at unpredictable times, probably when malloc decides to consolidate free space in the heap.

The fatal error occurs after the following sequences of operations:
- allocate and send a (large) buffer. this registers the memory area in grdma.
- free the buffer. after completion of the send the registration **remains cached**.
- allocate a smaller buffer. by chance, `malloc()` reuses the same memory address for the smaller allocation.
- perform an `MPI_Isend()` on the smaller buffer. this raises the reference count of the cached larger memory registration.
- `malloc()` decides to consolidate free heap memory, calling `madvise()` on the second part of our memory area.

This triggers the fatal error, because the cached registration of the large memory area is still marked as used.

The fundamental bug, I believe, is that `mca_rcache_base_vma_iterate()` returns all memory areas **overlapping** (? did not check) the queried area. Hence, `_intercept_madvise()` attempts to free all areas that **overlap** the area in question.

I believe the right behaviour would be to only free areas **fully covered** by the `madvise()` call. While this would lead to **some** areas not being freed, the current state leads to random fatal aborts. Disabling the `_intercept_madvise()` hook poses a temporary work-around.

Can someone confirm this bug and maybe its solution?
I also (currently) do not have enough experience with the OpenMPI codebase to write a patch.

I have attached a program which triggers the error by artificially calling `madvise()`. In my real application is the sporadically done from inside the libc. The error only occurs when using OpenIB over a real InfiniBand network, it does not occur when running with shared-memory or TCP.

[test_madvise.cpp.txt](https://github.com/open-mpi/ompi/files/1477937/test_madvise.cpp.txt)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

sporadic fatal error messages due to critical bug in `madvise()` hook with OpenIB #4509

Backtrace and Autopsy

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

sporadic fatal error messages due to critical bug in madvise() hook with OpenIB #4509

Description

Backtrace and Autopsy

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

sporadic fatal error messages due to critical bug in `madvise()` hook with OpenIB #4509