Skip to content

Segfault with Slurm (14.03.8), 2+ tasks, and OpenSees #235

@kcgthb

Description

@kcgthb

Hi,

One of our users reported a segfault that appears under specific circumstances.

When using Slurm (14.03.8) and srun to launch an OpenSees job (http://opensees.berkeley.edu/index.php, compiled and executed with OpenMPI 1.8.2) with more than 2 tasks, the execution fails with a segmentation fault that seems to occur in the vader BTL. I have no idea why it's being used, we're running a x86 Linux cluster on Redhat 6.5.

Reproducer is a follows:

$ srun -n 2 --pty bash
cn01:~$ srun OpenSeesMP
srun: error: cn01: task 0: Segmentation fault

The stack is:

cn01:~$ srun gdb OpenSeesMP
GNU gdb (GDB) Red Hat Enterprise Linux (7.2-60.el6_4.1)
[...]
Program received signal SIGSEGV, Segmentation fault.
0x00007ffff1c2e8ea in mca_btl_vader_sendi () from /share/sw/free/openmpi/1.8.2/gcc/4.4/lib/openmpi/mca_btl_vader.so
[...]
bt
(gdb) (gdb) #0  0x00007ffff1c2e8ea in mca_btl_vader_sendi () from /share/sw/free/openmpi/1.8.2/gcc/4.4/lib/openmpi/mca_btl_vader.so
#1  0x00007ffff15f604f in mca_pml_ob1_send_inline () from /share/sw/free/openmpi/1.8.2/gcc/4.4/lib/openmpi/mca_pml_ob1.so
#2  0x00007ffff15f6fe1 in mca_pml_ob1_send () from /share/sw/free/openmpi/1.8.2/gcc/4.4/lib/openmpi/mca_pml_ob1.so
#3  0x00007ffff7b7868a in PMPI_Send () from /share/sw/free/openmpi/1.8.2/gcc/4.4/lib/libmpi.so.1
#4  0x00000000005a1315 in MPI_Channel::sendID(int, int, ID const&, ChannelAddress*) ()
#5  0x00000000005787c6 in main ()

If I run only 1 task (srun -n1 OpenSeesMP), no segfault occurs.
If I use mpirun instead of srun, no segfault.
If I move aside mca_btl_vader.so, no segfault either.

So I have 2 questions:

  1. why is vader even used?
  2. what causes the segfault?

Thanks!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions