Skip to content

Fixing dlopen() the MPI shared library with RTLD_LOCAL  #3705

Closed
@dalcinl

Description

@dalcinl

The lack of support for dlopen()ing the MPI shared library within a local namespace has been a recurrent issue for implementors of MPI bindings in dynamic languages like R, Julia, Python. Even the Java JNI bindings you guys distribute have to resort to the hack of re-dlopening the library with RTLD_GLOBAL before the invocation of MPI_Init(). I complained about this ages ago, it was never fixed, and eventually I rolled my own re-delopen hack for mpi4py. However, I really HATE it because is quite fragile. For example, macOS changed at some point the rules. Also, in Linux distros, if the user does not install the openmpi-devel package, the symlink libmpi.so -> libmpi.so.<version> is not available, then the code implementing the hackery has to be updated from time to time every time the library version is bumped. BTW, you Java JNI bindings are broken because of this, you should not dlopen("libmpi.so",...), you have to dlopen("libmpi.so.20",...) in Linux to not depend on the openmpi-devel package at runtime.

As a reproducer, you have this piece of Python code.

import ctypes
libdir = "/home/devel/mpi/openmpi/2.1.1/lib/"

lib = ctypes.CDLL(libdir+"libmpi.so", ctypes.RTLD_LOCAL)
ierr = lib.MPI_Init(None,None)
assert ierr==0

r = ctypes.create_string_buffer(4096)
n = ctypes.c_int()
ierr = lib.MPI_Get_library_version(r, ctypes.byref(n))
assert ierr==0
print(r[0:n.value])

ierr = lib.MPI_Finalize()
assert ierr==0

Running it of course fails:

$ python ompi-dlopen.py 
[kw14821:30046] mca_base_component_repository_open: unable to open mca_patcher_overwrite: /home/devel/mpi/openmpi/2.1.1/lib/openmpi/mca_patcher_overwrite.so: undefined symbol: mca_patcher_base_patch_t_class (ignored)
... lots of output ...
[kw14821:30046] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!

The root problem is that all the *.so files in $prefix/lib/openmpi/ do not explicitly link to the MPI library. The following script hot-fix the issues using patchelf (version 0.9 required):

#!/bin/sh
prefix="/home/devel/mpi/openmpi/2.1.1"
for filename in $(ls $prefix/lib/openmpi/*.so); do
    patchelf --add-needed libmpi.so.20 $filename
    patchelf --set-rpath "\$ORIGIN/.." $filename
done

Please note the special rpath I'm adding $ORIGIN\.., for macOS it should be@loader_path/.. and the script should be based in install_name_tool.

After running the shell script above and hot-fixing the plugins, the Python script runs just fine:

$ ./ompi-fix-libs.sh 
$ python ompi-dlopen.py 
Open MPI v2.1.1, package: Open MPI dalcinl@kw14821 Distribution, ident: 2.1.1, repo rev: v2.1.0-100-ga2fdb5b, May 10, 2017

In short, I think fixing the issue once and for all (at least for Linux, macOS, and Solaris) is to link the plugins in $prefix/lib/openmpi with -L$BUILDDIR/lib -lmpi -Wl,-rpath,\$ORIGIN/.. in Linux/Solaris and -L$BUILDDIR/lib -lmpi -Wl,-rpath,@loader_path/.. in macOS.

Unfortunately, I cannot offer a patch, I'm not an expert on autotools and I have no idea how to implement these changes. However, I can offer my help to review changes and test them in Linux and macOS.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions