-
Notifications
You must be signed in to change notification settings - Fork 912
Closed
Description
Hi,
One of our users reported a segfault that appears under specific circumstances.
When using Slurm (14.03.8) and srun
to launch an OpenSees job (http://opensees.berkeley.edu/index.php, compiled and executed with OpenMPI 1.8.2) with more than 2 tasks, the execution fails with a segmentation fault that seems to occur in the vader BTL. I have no idea why it's being used, we're running a x86 Linux cluster on Redhat 6.5.
Reproducer is a follows:
$ srun -n 2 --pty bash cn01:~$ srun OpenSeesMP srun: error: cn01: task 0: Segmentation fault
The stack is:
cn01:~$ srun gdb OpenSeesMP GNU gdb (GDB) Red Hat Enterprise Linux (7.2-60.el6_4.1) [...] Program received signal SIGSEGV, Segmentation fault. 0x00007ffff1c2e8ea in mca_btl_vader_sendi () from /share/sw/free/openmpi/1.8.2/gcc/4.4/lib/openmpi/mca_btl_vader.so [...] bt (gdb) (gdb) #0 0x00007ffff1c2e8ea in mca_btl_vader_sendi () from /share/sw/free/openmpi/1.8.2/gcc/4.4/lib/openmpi/mca_btl_vader.so #1 0x00007ffff15f604f in mca_pml_ob1_send_inline () from /share/sw/free/openmpi/1.8.2/gcc/4.4/lib/openmpi/mca_pml_ob1.so #2 0x00007ffff15f6fe1 in mca_pml_ob1_send () from /share/sw/free/openmpi/1.8.2/gcc/4.4/lib/openmpi/mca_pml_ob1.so #3 0x00007ffff7b7868a in PMPI_Send () from /share/sw/free/openmpi/1.8.2/gcc/4.4/lib/libmpi.so.1 #4 0x00000000005a1315 in MPI_Channel::sendID(int, int, ID const&, ChannelAddress*) () #5 0x00000000005787c6 in main ()
If I run only 1 task (srun -n1 OpenSeesMP
), no segfault occurs.
If I use mpirun
instead of srun
, no segfault.
If I move aside mca_btl_vader.so
, no segfault either.
So I have 2 questions:
- why is vader even used?
- what causes the segfault?
Thanks!
Metadata
Metadata
Assignees
Labels
No labels