Description
The lack of support for dlopen()ing the MPI shared library within a local namespace has been a recurrent issue for implementors of MPI bindings in dynamic languages like R, Julia, Python. Even the Java JNI bindings you guys distribute have to resort to the hack of re-dlopening the library with RTLD_GLOBAL
before the invocation of MPI_Init()
. I complained about this ages ago, it was never fixed, and eventually I rolled my own re-delopen hack for mpi4py. However, I really HATE it because is quite fragile. For example, macOS changed at some point the rules. Also, in Linux distros, if the user does not install the openmpi-devel
package, the symlink libmpi.so -> libmpi.so.<version>
is not available, then the code implementing the hackery has to be updated from time to time every time the library version is bumped. BTW, you Java JNI bindings are broken because of this, you should not dlopen("libmpi.so",...)
, you have to dlopen("libmpi.so.20",...)
in Linux to not depend on the openmpi-devel
package at runtime.
As a reproducer, you have this piece of Python code.
import ctypes
libdir = "/home/devel/mpi/openmpi/2.1.1/lib/"
lib = ctypes.CDLL(libdir+"libmpi.so", ctypes.RTLD_LOCAL)
ierr = lib.MPI_Init(None,None)
assert ierr==0
r = ctypes.create_string_buffer(4096)
n = ctypes.c_int()
ierr = lib.MPI_Get_library_version(r, ctypes.byref(n))
assert ierr==0
print(r[0:n.value])
ierr = lib.MPI_Finalize()
assert ierr==0
Running it of course fails:
$ python ompi-dlopen.py
[kw14821:30046] mca_base_component_repository_open: unable to open mca_patcher_overwrite: /home/devel/mpi/openmpi/2.1.1/lib/openmpi/mca_patcher_overwrite.so: undefined symbol: mca_patcher_base_patch_t_class (ignored)
... lots of output ...
[kw14821:30046] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
The root problem is that all the *.so
files in $prefix/lib/openmpi/
do not explicitly link to the MPI library. The following script hot-fix the issues using patchelf
(version 0.9 required):
#!/bin/sh
prefix="/home/devel/mpi/openmpi/2.1.1"
for filename in $(ls $prefix/lib/openmpi/*.so); do
patchelf --add-needed libmpi.so.20 $filename
patchelf --set-rpath "\$ORIGIN/.." $filename
done
Please note the special rpath I'm adding $ORIGIN\..
, for macOS it should be@loader_path/..
and the script should be based in install_name_tool
.
After running the shell script above and hot-fixing the plugins, the Python script runs just fine:
$ ./ompi-fix-libs.sh
$ python ompi-dlopen.py
Open MPI v2.1.1, package: Open MPI dalcinl@kw14821 Distribution, ident: 2.1.1, repo rev: v2.1.0-100-ga2fdb5b, May 10, 2017
In short, I think fixing the issue once and for all (at least for Linux, macOS, and Solaris) is to link the plugins in $prefix/lib/openmpi
with -L$BUILDDIR/lib -lmpi -Wl,-rpath,\$ORIGIN/..
in Linux/Solaris and -L$BUILDDIR/lib -lmpi -Wl,-rpath,@loader_path/..
in macOS.
Unfortunately, I cannot offer a patch, I'm not an expert on autotools and I have no idea how to implement these changes. However, I can offer my help to review changes and test them in Linux and macOS.