Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segfault with Slurm (14.03.8), 2+ tasks, and OpenSees #235

Closed
kcgthb opened this issue Oct 14, 2014 · 38 comments
Closed

Segfault with Slurm (14.03.8), 2+ tasks, and OpenSees #235

kcgthb opened this issue Oct 14, 2014 · 38 comments

Comments

@kcgthb
Copy link

kcgthb commented Oct 14, 2014

Hi,

One of our users reported a segfault that appears under specific circumstances.

When using Slurm (14.03.8) and srun to launch an OpenSees job (http://opensees.berkeley.edu/index.php, compiled and executed with OpenMPI 1.8.2) with more than 2 tasks, the execution fails with a segmentation fault that seems to occur in the vader BTL. I have no idea why it's being used, we're running a x86 Linux cluster on Redhat 6.5.

Reproducer is a follows:

$ srun -n 2 --pty bash
cn01:~$ srun OpenSeesMP
srun: error: cn01: task 0: Segmentation fault

The stack is:

cn01:~$ srun gdb OpenSeesMP
GNU gdb (GDB) Red Hat Enterprise Linux (7.2-60.el6_4.1)
[...]
Program received signal SIGSEGV, Segmentation fault.
0x00007ffff1c2e8ea in mca_btl_vader_sendi () from /share/sw/free/openmpi/1.8.2/gcc/4.4/lib/openmpi/mca_btl_vader.so
[...]
bt
(gdb) (gdb) #0  0x00007ffff1c2e8ea in mca_btl_vader_sendi () from /share/sw/free/openmpi/1.8.2/gcc/4.4/lib/openmpi/mca_btl_vader.so
#1  0x00007ffff15f604f in mca_pml_ob1_send_inline () from /share/sw/free/openmpi/1.8.2/gcc/4.4/lib/openmpi/mca_pml_ob1.so
#2  0x00007ffff15f6fe1 in mca_pml_ob1_send () from /share/sw/free/openmpi/1.8.2/gcc/4.4/lib/openmpi/mca_pml_ob1.so
#3  0x00007ffff7b7868a in PMPI_Send () from /share/sw/free/openmpi/1.8.2/gcc/4.4/lib/libmpi.so.1
#4  0x00000000005a1315 in MPI_Channel::sendID(int, int, ID const&, ChannelAddress*) ()
#5  0x00000000005787c6 in main ()

If I run only 1 task (srun -n1 OpenSeesMP), no segfault occurs.
If I use mpirun instead of srun, no segfault.
If I move aside mca_btl_vader.so, no segfault either.

So I have 2 questions:

  1. why is vader even used?
  2. what causes the segfault?

Thanks!

@rhc54
Copy link
Contributor

rhc54 commented Oct 15, 2014

Vader is an alternative, potentially faster, shared memory BTL. I'm not sure why it would be segfaulting on your machine, but it is probably missing some startup info when direct launched by srun instead of mpirun. For those cases, you can just put OMPI_MCA_btl=^vader in your environment to turn it off.

We'll take a look and see if we can ID the problem. Have you tried 1.8.3 to see if the problem exists there too?

On Oct 14, 2014, at 3:34 PM, Kilian Cavalotti notifications@github.com wrote:

Hi,

One of our users reported a segfault that appears under specific circumstances.

When using Slurm (14.03.8) and srun to launch an OpenSees job (http://opensees.berkeley.edu/index.php, compiled and executed with OpenMPI 1.8.2) with more than 2 tasks, the execution fails with a segmentation fault that seems to occur in the vader BTL. I have no idea why it's being used, we're running a x86 Linux cluster on Redhat 6.5.

Reproducer is a follows:

$ srun -n 2 --pty bash
cn01:~$ srun OpenSeesMP
srun: error: cn01: task 0: Segmentation fault
The stack is:

cn01:~$ srun gdb OpenSeesMP
GNU gdb (GDB) Red Hat Enterprise Linux (7.2-60.el6_4.1)
[...]
Program received signal SIGSEGV, Segmentation fault.
0x00007ffff1c2e8ea in mca_btl_vader_sendi () from /share/sw/free/openmpi/1.8.2/gcc/4.4/lib/openmpi/mca_btl_vader.so
[...]
bt
(gdb) (gdb) #0 0x00007ffff1c2e8ea in mca_btl_vader_sendi () from /share/sw/free/openmpi/1.8.2/gcc/4.4/lib/openmpi/mca_btl_vader.so
#1 0x00007ffff15f604f in mca_pml_ob1_send_inline () from /share/sw/free/openmpi/1.8.2/gcc/4.4/lib/openmpi/mca_pml_ob1.so
#2 0x00007ffff15f6fe1 in mca_pml_ob1_send () from /share/sw/free/openmpi/1.8.2/gcc/4.4/lib/openmpi/mca_pml_ob1.so
#3 0x00007ffff7b7868a in PMPI_Send () from /share/sw/free/openmpi/1.8.2/gcc/4.4/lib/libmpi.so.1
#4 0x00000000005a1315 in MPI_Channel::sendID(int, int, ID const&, ChannelAddress*) ()
#5 0x00000000005787c6 in main ()
If I run only 1 task (srun -n1 OpenSeesMP), no segfault occurs.
If I use mpirun instead of srun, no segfault.
If I move aside mca_btl_vader.so, no segfault either.

So I have 2 questions:

  1. why is vader even used?
  2. what causes the segfault?

Thanks!


Reply to this email directly or view it on GitHub.

@kcgthb
Copy link
Author

kcgthb commented Oct 15, 2014

@rhc54: thanks for the explanation. Setting OMPI_MCA_btl=^vader indeed prevents the segfault from happening.

I've installed 1.8.3 and I'll work with the user to see if we can reproduce the problem with that version. I'll keep you posted.

@jsquyres
Copy link
Member

@hjelmn Are you aware of any direct-launch problems with Vader?

@kcgthb
Copy link
Author

kcgthb commented Oct 15, 2014

Quick update: the segfault happens under the same circumstances with Open MPI 1.8.3, and produces the same exact backtrace.

@rhc54
Copy link
Contributor

rhc54 commented Oct 15, 2014

Are you using PMI-1 or PMI-2?

On Oct 15, 2014, at 11:29 AM, Kilian Cavalotti notifications@github.com wrote:

Quick update: the segfault happens under the same circumstances with Open MPI 1.8.3, and produces the same exact backtrace.


Reply to this email directly or view it on GitHub.

@rhc54
Copy link
Contributor

rhc54 commented Oct 15, 2014

Using the head of the 1.8 series (to be released soon as 1.8.4) and Slurm 2.5.4 (which means PMI-1), it works just fine.

On Oct 15, 2014, at 11:39 AM, Ralph Castain rhc@open-mpi.org wrote:

Are you using PMI-1 or PMI-2?

On Oct 15, 2014, at 11:29 AM, Kilian Cavalotti notifications@github.com wrote:

Quick update: the segfault happens under the same circumstances with Open MPI 1.8.3, and produces the same exact backtrace.


Reply to this email directly or view it on GitHub.

@kcgthb
Copy link
Author

kcgthb commented Oct 15, 2014

@rhc54: I'm actually not sure. Slurm 14.03.8 provides both PMI-1 and PMI-2, and I compiled Open MPI with --with-mpi, which seems to have picked up both. libmca_common_pmi.so ends up being linked against both libpmi.so.0 and libpmi2.so.0. From the Open MPI FAQ, it seems that if both are available, PMI-2 would be used, is that correct?

@kcgthb
Copy link
Author

kcgthb commented Oct 15, 2014

If I run srun --mpi=none OpenSeesMP, I got the following message, but no segfault.

--------------------------------------------------------------------------
PMI2 failed to initialize, returning an error code of 14.
We cannot use PMI2 at this time, and your job will
likely abort.
--------------------------------------------------------------------------

So it looks indeed that it occurs in conjunction with PMI-2

@rhc54
Copy link
Contributor

rhc54 commented Oct 15, 2014

Nope - Slurm defaults to using PMI-1 unless you explicitly tell it to use pmi2 on the srun cmd line. We just link so we have support for either one you choose to use.

FWIW: I can direct launch an MPI "hello" and an MPI ring example just fine using PMI-1 or PMI-2 under what was released as Slurm 14 without problem, including letting vader run.

On Oct 15, 2014, at 11:57 AM, Kilian Cavalotti notifications@github.com wrote:

@rhc54: I'm actually not sure. Slurm 14.03.8 provides both PMI-1 and PMI-2, and I compiled Open MPI with --with-mpi, which seems to have picked up both. libmca_common_pmi.so ends up being linked against both libpmi.so.0 and libpmi2.so.0. From the Open MPI FAQ, it seems that if both are available, PMI-2 would be used, is that correct?


Reply to this email directly or view it on GitHub.

@kcgthb
Copy link
Author

kcgthb commented Oct 15, 2014

@rhc54: Oh ok. I was under the impression that PMI-2 would be used over PMI-1 by reading the FAQ, especially this: "When the --with-pmi option is given, OMPI will automatically determine if PMI-2 support was built and use it in place of PMI-1" in https://www.open-mpi.org/faq/?category=slurm#slurm-direct-srun-mpi-apps

I was also able to run a simple MPI hello world program without a segfault, so it seems to be somewhat related to the OpenSees application, except the stack trace really seems to point at vader, which confuses me.

@rhc54
Copy link
Contributor

rhc54 commented Oct 15, 2014

On Oct 15, 2014, at 1:19 PM, Kilian Cavalotti notifications@github.com wrote:

@rhc54: Oh ok. I was under the impression that PMI-2 would be used over PMI-1 by reading the FAQ, especially this: "When the --with-pmi option is given, OMPI will automatically determine if PMI-2 support was built and use it in place of PMI-1" in https://www.open-mpi.org/faq/?category=slurm#slurm-direct-srun-mpi-apps

We can only decide that for build purposes - Slurm needs to know which one to actually use when it runs. So if you want Slurm to use PMI-2, you have to tell it to do so. Either on the srun cmd line, or you can set it as the default in the slurm.conf file.

I was also able to run a simple MPI hello world program without a segfault, so it seems to be somewhat related to the OpenSees application, except the stack trace really seems to point at vader, which confuses me.

Could be memory corruption? You could try running it thru valgrind. Add --enable-memchecker --enable-valgrind to the OMPI configure line so OMPI will be (relatively) clean.


Reply to this email directly or view it on GitHub.

@hjelmn
Copy link
Member

hjelmn commented Oct 16, 2014

@jsquyres vader runs cleanly for me launching directly with either srun and aprun

Can you provide more details? A line number at a minimum would help finger the culprit. It is very very unlikely a vader issue but rather an application issue. The sendi function does nothing more than copy from the user's pointer to a shared memory region. The region is allocated in exactly the same way it is in btl/sm (except each process owns its own segment).

@ggouaillardet
Copy link
Contributor

i am able to reproduce the issue
under srun, when task 0 calls vader_add_procs, it ends up calling
segment_attach from shmem_mmap_module.c
but
ds_buf->seg_name is "/tmp/openmpi-sessi"
/* truncated string /
then
ds_buf->seg_id=open(ds_buf->seg_name, O_CREAT | O_RDWR, 0600)
creates an empty file.
/
i did not analyze this fully, but comments tell that segment_attach can only
be called after a successful call to segment_create, so do we really need the O_CREAT flag ? */
and then
mmap(...,ds_buf->seg_size, ..., ds_buf->seg_id,...);
does success though seg_size is 4M and the file is empty
which means we cannot read/write to ds_buf->seg_base_addr and that will lead do a crash later.

at this stage of the investigation :

  • the truncated string is the cause of the crash (but not the root cause i am now looking for,
    string is not truncated when mpirun is used)
  • is the O_CREAT flag required or a bug ?
  • can someone explain why mmap did not fail ? that looks very counter intuitive to me ...

@ggouaillardet
Copy link
Contributor

PR #238 makes the master work again, i am now investigating the latest v1.8

@ggouaillardet
Copy link
Contributor

i tried several things but i am unable to reproduce the issue with v1.8

@rhc54 could you please clarify for me pmi1 vs pmi2 ?
my undestanding is that :

  • in v1.8, pmi1 vs pmi2 is decided at configure time
  • in trunk, the final decision will be done at runtime
    is that correct ?

@kcgthb your hello world program must include at least one send/recv to ensure you get a chance to hit the issue

@rhc54
Copy link
Contributor

rhc54 commented Oct 16, 2014

On Oct 16, 2014, at 12:25 AM, Gilles Gouaillardet notifications@github.com wrote:

i tried several things but i am unable to reproduce the issue with v1.8

@rhc54 could you please clarify for me pmi1 vs pmi2 ?
my undestanding is that :

in v1.8, pmi1 vs pmi2 is decided at configure time

Yes - if you configure for pmi2, then we only support pmi2 operations

in trunk, the final decision will be done at runtime is that correct ?
Yes and no - the capability for that is present, but it doesn't always work. For example, Slurm has a bug in it's pmi libraries such that PMI_Init always returns true even if pmi1 isn't available. We haven't found a reliable way of detecting when pmi1 vs pmi2 is being used.
@kcgthb your hello world program must include at least one send/recv to ensure you get a chance to hit the issue

Ah - that would explain why my tests were passing. Thanks!


Reply to this email directly or view it on GitHub.

@hjelmn
Copy link
Member

hjelmn commented Oct 16, 2014

@ggouaillardet Thanks for tracking that down. I don't know why opal/shmem uses O_CREAT there. I will ask the original author of that code to see whats going on. I assumed segment_attach would fail if the file didn't exist disqualifying vader. btl/sm was only working because it doesn't send the filename through the modex.

@rhc54
Copy link
Contributor

rhc54 commented Oct 16, 2014

Nathan: could you please take a look at the proposed fix? I flagged you, Howard, and Elena on it for review as I'm not sure of the impact it will have on minimizing #keys pushed to pmi.

On Oct 16, 2014, at 6:59 AM, Nathan Hjelm notifications@github.com wrote:

@ggouaillardet Thanks for tracking that down. I don't know why opal/shmem uses O_CREAT there. I will ask the original author of that code to see whats going on. I assumed segment_attach would fail if the file didn't exist disqualifying vader. btl/sm was only working because it doesn't send the filename through the modex.


Reply to this email directly or view it on GitHub.

@ggouaillardet
Copy link
Contributor

This will send less keys (from a pure pmi point of view) but all the keys will be sent at the last minute.

If the goal was to send keys at regular interval, then slicing was fine but the logic to re-assemble the slices must be reviewed.

rhc54 notifications@github.com wrote:

Nathan: could you please take a look at the proposed fix? I flagged you, Howard, and Elena on it for review as I'm not sure of the impact it will have on minimizing #keys pushed to pmi.

On Oct 16, 2014, at 6:59 AM, Nathan Hjelm notifications@github.com wrote:

@ggouaillardet Thanks for tracking that down. I don't know why opal/shmem uses O_CREAT there. I will ask the original author of that code to see whats going on. I assumed segment_attach would fail if the file didn't exist disqualifying vader. btl/sm was only working because it doesn't send the filename through the modex.


Reply to this email directly or view it on GitHub.


Reply to this email directly or view it on GitHub.

{"@context":"http://schema.org","@type":"EmailMessage","description":"View this Issue on GitHub","action":{"@type":"ViewAction","url":"https://github.com/open-mpi/ompi/issues/235#issuecomment-59365430","name":"View Issue"}}

@kcgthb
Copy link
Author

kcgthb commented Oct 16, 2014

@ggouaillardet: Salut Gilles ! Toujours à Rokkasho ?
Thanks for the great investigation!
You're also right about the send/recv part, I was able to reproduce the segfault in OMPI 1.8.3 with this simple example: https://github.com/wesleykendall/mpitutorial/blob/master/mpi_send_recv/send_recv.c

@rhc54: regarding memory corruption, I tried this on a bunch of different machines, and the segfault behavior is consistent.

@rhc54
Copy link
Contributor

rhc54 commented Oct 16, 2014

On Oct 16, 2014, at 12:55 PM, Kilian Cavalotti notifications@github.com wrote:

@ggouaillardet: Salut Gilles ! Toujours à Rokkasho ?
Thanks for the great investigation!
You're also right about the send/recv part, I was able to reproduce the segfault in OMPI 1.8.3 with this simple example: https://github.com/wesleykendall/mpitutorial/blob/master/mpi_send_recv/send_recv.c

@rhc54: regarding memory corruption, I tried this on a bunch of different machines, and the segfault behavior is consistent.

Yes, I think Gilles hit the key point and fixed the problem. We are looking at the fix to ensure it doesn't have some unintended consequences, and then will push into 1.8.4 which will release soon.


Reply to this email directly or view it on GitHub.

@kcgthb
Copy link
Author

kcgthb commented Oct 16, 2014

@rhc54: that looks awesome, thanks.

@ggouaillardet
Copy link
Contributor

@rhc54 the point is hit is only for the master, i did not see such thing in the v1.8 branch

@kcgthb can you tell me more about your config ?

by default (configure && make install) slurm 14.03.8 provides only PMI1 support.
did you also install PMI2 support from the contribs directory ?
how did you configure slurm and openmpi ?
did you configure/make openmpi on the same system that is running slurm and how ?
could you please run
ldd /.../lib/openmpi/mca_ess_pmi.so
and run
OMPI_MCA_shmem_base_verbose=255 srun ./send_recv
and check for truncated file names ?

when the crash occurs under gdb, can you run
pmap
and make sure the vader files (/../vader_segment.*) are 4MB

thanks

Gilles

PS
yep, still in Rokkasho ... not everybody gets a chance to return to Palo Alto ;-)

@kcgthb
Copy link
Author

kcgthb commented Oct 17, 2014

@ggouaillardet

did you also install PMI2 support from the contribs directory ?

I configured OMPI with the --with-pmi flag, and apparently it picked up both PMI and PMI-2, since our /usr/include/slurm contains pmi.h AND pmi2.h. libmca_common_pmi.so ends up being linked against both libpmi.so.0 and libpmi2.so.0.

how did you configure slurm and openmpi ?

  • Slurm: I used the provided .spec file, and created the RPMs with rpmbuild -ta --with blcr --with pam --with lua slurm-14.03.8.tar.bz2
  • Open MPI: /configure --with-pmi --with-hwloc=internal --prefix=/share/sw/free/openmpi/1.8.3/

did you configure/make openmpi on the same system that is running slurm and how ?

Yes.

could you please run
ldd /.../lib/openmpi/mca_ess_pmi.so

$ ldd /share/sw/free/openmpi/1.8.3/gcc/4.4/lib/openmpi/mca_ess_pmi.so
        linux-vdso.so.1 =>  (0x00007fffcf7ff000)
        libcr_run.so => /usr/lib64/libcr_run.so (0x00007f03a8884000)
        libmca_common_pmi.so.1 => /share/sw/free/openmpi/1.8.3/gcc/4.4/lib/libmca_common_pmi.so.1 (0x00007f03a8681000)
        libpmi2.so.0 => /usr/lib64/libpmi2.so.0 (0x00007f03a8469000)
        libpmi.so.0 => /usr/lib64/libpmi.so.0 (0x00007f03a8264000)
        libslurm.so.27 => /usr/lib64/libslurm.so.27 (0x00007f03a7f30000)
        libdl.so.2 => /lib64/libdl.so.2 (0x00007f03a7d2c000)
        libhwloc.so.5 => /usr/lib64/libhwloc.so.5 (0x00007f03a7afb000)
        libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f03a78dd000)
        librt.so.1 => /lib64/librt.so.1 (0x00007f03a76d5000)
        libm.so.6 => /lib64/libm.so.6 (0x00007f03a7451000)
        libutil.so.1 => /lib64/libutil.so.1 (0x00007f03a724d000)
        libc.so.6 => /lib64/libc.so.6 (0x00007f03a6eb9000)
        /lib64/ld-linux-x86-64.so.2 (0x0000003b6aa00000)
        libxml2.so.2 => /usr/lib64/libxml2.so.2 (0x00007f03a6b66000)
        libz.so.1 => /lib64/libz.so.1 (0x00007f03a6950000)

and run
OMPI_MCA_shmem_base_verbose=255 srun ./send_recv
and check for truncated file names ?

Mmmh, I don't see any file name:

$ OMPI_MCA_shmem_base_verbose=255 srun ./send_recv
[sh-5-35.local:29589] mca: base: components_register: registering shmem components
[sh-5-35.local:29588] mca: base: components_register: registering shmem components
[sh-5-35.local:29588] mca: base: components_register: found loaded component mmap
[sh-5-35.local:29588] mca: base: components_register: component mmap register function successful
[sh-5-35.local:29588] mca: base: components_register: found loaded component posix
[sh-5-35.local:29589] mca: base: components_register: found loaded component mmap
[sh-5-35.local:29588] mca: base: components_register: component posix register function successful
[sh-5-35.local:29589] mca: base: components_register: component mmap register function successful
[sh-5-35.local:29589] mca: base: components_register: found loaded component posix
[sh-5-35.local:29589] mca: base: components_register: component posix register function successful
[sh-5-35.local:29589] mca: base: components_register: found loaded component sysv
[sh-5-35.local:29589] mca: base: components_register: component sysv register function successful
[sh-5-35.local:29589] mca: base: components_open: opening shmem components
[sh-5-35.local:29589] mca: base: components_open: found loaded component mmap
[sh-5-35.local:29589] mca: base: components_open: component mmap open function successful
[sh-5-35.local:29589] mca: base: components_open: found loaded component posix
[sh-5-35.local:29589] mca: base: components_open: component posix open function successful
[sh-5-35.local:29589] mca: base: components_open: found loaded component sysv
[sh-5-35.local:29589] mca: base: components_open: component sysv open function successful
[sh-5-35.local:29589] shmem: base: runtime_query: Auto-selecting shmem components
[sh-5-35.local:29589] shmem: base: runtime_query: (shmem) Querying component (run-time) [mmap]
[sh-5-35.local:29589] shmem: base: runtime_query: (shmem) Query of component [mmap] set priority to 50
[sh-5-35.local:29589] shmem: base: runtime_query: (shmem) Querying component (run-time) [posix]
[sh-5-35.local:29589] shmem: base: runtime_query: (shmem) Query of component [posix] set priority to 40
[sh-5-35.local:29589] shmem: base: runtime_query: (shmem) Querying component (run-time) [sysv]
[sh-5-35.local:29589] shmem: base: runtime_query: (shmem) Query of component [sysv] set priority to 30
[sh-5-35.local:29589] shmem: base: runtime_query: (shmem) Selected component [mmap]
[sh-5-35.local:29589] mca: base: close: unloading component posix
[sh-5-35.local:29588] mca: base: components_register: found loaded component sysv
[sh-5-35.local:29588] mca: base: components_register: component sysv register function successful
[sh-5-35.local:29588] mca: base: components_open: opening shmem components
[sh-5-35.local:29589] mca: base: close: unloading component sysv
[sh-5-35.local:29588] mca: base: components_open: found loaded component mmap
[sh-5-35.local:29588] mca: base: components_open: component mmap open function successful
[sh-5-35.local:29588] mca: base: components_open: found loaded component posix
[sh-5-35.local:29588] mca: base: components_open: component posix open function successful
[sh-5-35.local:29588] mca: base: components_open: found loaded component sysv
[sh-5-35.local:29588] mca: base: components_open: component sysv open function successful
[sh-5-35.local:29588] shmem: base: runtime_query: Auto-selecting shmem components
[sh-5-35.local:29588] shmem: base: runtime_query: (shmem) Querying component (run-time) [mmap]
[sh-5-35.local:29588] shmem: base: runtime_query: (shmem) Query of component [mmap] set priority to 50
[sh-5-35.local:29588] shmem: base: runtime_query: (shmem) Querying component (run-time) [posix]
[sh-5-35.local:29588] shmem: base: runtime_query: (shmem) Query of component [posix] set priority to 40
[sh-5-35.local:29588] shmem: base: runtime_query: (shmem) Querying component (run-time) [sysv]
[sh-5-35.local:29588] shmem: base: runtime_query: (shmem) Query of component [sysv] set priority to 30
[sh-5-35.local:29588] shmem: base: runtime_query: (shmem) Selected component [mmap]
[sh-5-35.local:29588] mca: base: close: unloading component posix
[sh-5-35.local:29588] mca: base: close: unloading component sysv
srun: error: sh-5-35: task 0: Segmentation fault

when the crash occurs under gdb, can you run
pmap
and make sure the vader files (/../vader_segment.*) are 4MB

They are 4.1MB:

# ps aux | grep send
kilian   30093  0.0  0.0 323444  4804 pts/0    Sl+  09:10   0:00 srun gdb ./send_recv
kilian   30094  0.0  0.0  43380   784 pts/0    S+   09:10   0:00 srun gdb ./send_recv
kilian   30108  0.4  0.0 215536 31900 ?        S    09:10   0:00 /usr/bin/gdb ./send_recv
kilian   30109  0.4  0.0 212068 30864 ?        S    09:10   0:00 /usr/bin/gdb ./send_recv
kilian   30117  0.1  0.0 323068 11080 ?        TLl  09:10   0:00 /home/kilian/tests/mpi_send_recv/send_recv
kilian   30118 98.5  0.0 323024 11076 ?        RLl  09:10   1:01 /home/kilian/tests/mpi_send_recv/send_recv
root     30433  0.0  0.0 105312   884 pts/1    S+   09:11   0:00 grep send
# pmap  30117 | grep vader_seg
00007fffe6d73000   4100K rw-s-  /tmp/openmpi-sessions-kilian@sh-5-35_0/16356/3/0/vader_segment.sh-5-35.0
# pmap  30118 | grep vader_seg
00007fffeed74000   4100K rw-s-  /tmp/openmpi-sessions-kilian@sh-5-35_0/16356/3/1/vader_segment.sh-5-35.1

PS: Yeah, I can't complain. :)

@ggouaillardet
Copy link
Contributor

@kcgthb your config looks good
The .spec file installs pmi2 support, and since you are running v1.8 you are using pmi2

Could you please run
srun -N 1 -n 2 strace -f -e getpid ./send_recv

I also found that some errors are not correctly reported by vader and i will fix that on monday

@ggouaillardet
Copy link
Contributor

@kcgthb you do not see any filenames because ompi was not configure'd with --enable-debug

@kcgthb
Copy link
Author

kcgthb commented Oct 18, 2014

Could you please run
srun -N 1 -n 2 strace -f -e getpid ./send_recv

Doesn't seem to be any getpid() syscall in the execution:

$ srun -N 1 -n 2 strace -f -e getpid ./send_recv
Process 24058 attached
Process 24059 attached
Process 24060 attached
Process 24061 attached
Process 24064 attached
Process 24065 attached
[pid 24052] --- SIGSEGV (Segmentation fault) @ 0 (0) ---
[pid 24052] --- SIGSEGV (Segmentation fault) @ 0 (0) ---
[pid 24061] +++ killed by SIGSEGV +++
[pid 24059] +++ killed by SIGSEGV +++
[pid 24065] +++ killed by SIGSEGV +++

@ggouaillardet
Copy link
Contributor

@kcgthb i cannot reproduce the issue with a similar environment

here is attached a patch for openmpi 1.8.3
could you please apply it and run again

OMPI_MCA_shmem_base_verbose=255 srun ./send_recv
diff -ruN orig/openmpi-1.8.3/ompi/mca/btl/vader/btl_vader_module.c openmpi-1.8.3/ompi/mca/btl/vader/btl_vader_module.c
--- orig/openmpi-1.8.3/ompi/mca/btl/vader/btl_vader_module.c 2014-07-12 03:12:21.000000000 +0900
+++ openmpi-1.8.3/ompi/mca/btl/vader/btl_vader_module.c 2014-10-20 13:57:20.810995612 +0900
@@ -193,7 +193,7 @@

     ep->segment_base = opal_shmem_segment_attach (&ep->seg_ds);
     if (NULL == ep->segment_base) {
  •        return rc;
    
  •        return OPAL_ERROR;
     }
    
    #endif

diff -ruN orig/openmpi-1.8.3/opal/mca/shmem/mmap/shmem_mmap_module.c openmpi-1.8.3/opal/mca/shmem/mmap/shmem_mmap_module.c
--- orig/openmpi-1.8.3/opal/mca/shmem/mmap/shmem_mmap_module.c 2014-07-12 03:12:00.000000000 +0900
+++ openmpi-1.8.3/opal/mca/shmem/mmap/shmem_mmap_module.c 2014-10-20 14:09:09.527999189 +0900
@@ -115,13 +115,13 @@
static inline void
shmem_ds_reset(opal_shmem_ds_t *ds_buf)
{

  • OPAL_OUTPUT_VERBOSE(

  •    (70, opal_shmem_base_framework.framework_output,
    
  •     "%s: %s: shmem_ds_resetting "
    
  •     "(id: %d, size:  %lu, name: %s)\n",
    
  •     mca_shmem_mmap_component.super.base_version.mca_type_name,
    
  •     mca_shmem_mmap_component.super.base_version.mca_component_name,
    
  •     ds_buf->seg_id, (unsigned long)ds_buf->seg_size, ds_buf->seg_name)
    
  • opal_output_verbose(

  •    70, opal_shmem_base_framework.framework_output,
    
  •    "%s: %s: shmem_ds_resetting "
    
  •    "(id: %d, size:  %lu, name: %s)\n",
    
  •    mca_shmem_mmap_component.super.base_version.mca_type_name,
    
  •    mca_shmem_mmap_component.super.base_version.mca_component_name,
    
  •    ds_buf->seg_id, (unsigned long)ds_buf->seg_size, ds_buf->seg_name
    

    );

    ds_buf->seg_cpid = 0;
    @@ -155,18 +155,18 @@
    {
    memcpy(to, from, sizeof(opal_shmem_ds_t));

  • OPAL_OUTPUT_VERBOSE(

  •    (70, opal_shmem_base_framework.framework_output,
    
  •     "%s: %s: ds_copy complete "
    
  •     "from: (id: %d, size: %lu, "
    
  •     "name: %s flags: 0x%02x) "
    
  •     "to: (id: %d, size: %lu, "
    
  •     "name: %s flags: 0x%02x)\n",
    
  •     mca_shmem_mmap_component.super.base_version.mca_type_name,
    
  •     mca_shmem_mmap_component.super.base_version.mca_component_name,
    
  •     from->seg_id, (unsigned long)from->seg_size, from->seg_name,
    
  •     from->flags, to->seg_id, (unsigned long)to->seg_size, to->seg_name,
    
  •     to->flags)
    
  • opal_output_verbose(

  •    70, opal_shmem_base_framework.framework_output,
    
  •    "%s: %s: ds_copy complete "
    
  •    "from: (id: %d, size: %lu, "
    
  •    "name: %s flags: 0x%02x) "
    
  •    "to: (id: %d, size: %lu, "
    
  •    "name: %s flags: 0x%02x)\n",
    
  •    mca_shmem_mmap_component.super.base_version.mca_type_name,
    
  •    mca_shmem_mmap_component.super.base_version.mca_component_name,
    
  •    from->seg_id, (unsigned long)from->seg_size, from->seg_name,
    
  •    from->flags, to->seg_id, (unsigned long)to->seg_size, to->seg_name,
    
  •    to->flags
    

    );

    return OPAL_SUCCESS;
    @@ -293,12 +293,12 @@
    }
    }

  • OPAL_OUTPUT_VERBOSE(

  •    (70, opal_shmem_base_framework.framework_output,
    
  •     "%s: %s: backing store base directory: %s\n",
    
  •     mca_shmem_mmap_component.super.base_version.mca_type_name,
    
  •     mca_shmem_mmap_component.super.base_version.mca_component_name,
    
  •     real_file_name)
    
  • opal_output_verbose(

  •    70, opal_shmem_base_framework.framework_output,
    
  •    "%s: %s: backing store base directory: %s\n",
    
  •    mca_shmem_mmap_component.super.base_version.mca_type_name,
    
  •    mca_shmem_mmap_component.super.base_version.mca_component_name,
    
  •    real_file_name
    

    );

    /* determine whether the specified filename is on a network file system.
    @@ -367,13 +367,13 @@
    /* set "valid" bit because setment creation was successful */
    OPAL_SHMEM_DS_SET_VALID(ds_buf);

  •    OPAL_OUTPUT_VERBOSE(
    
  •        (70, opal_shmem_base_framework.framework_output,
    
  •         "%s: %s: create successful "
    
  •         "(id: %d, size: %lu, name: %s)\n",
    
  •         mca_shmem_mmap_component.super.base_version.mca_type_name,
    
  •         mca_shmem_mmap_component.super.base_version.mca_component_name,
    
  •         ds_buf->seg_id, (unsigned long)ds_buf->seg_size, ds_buf->seg_name)
    
  •    opal_output_verbose(
    
  •        70, opal_shmem_base_framework.framework_output,
    
  •        "%s: %s: create successful "
    
  •        "(id: %d, size: %lu, name: %s)\n",
    
  •        mca_shmem_mmap_component.super.base_version.mca_type_name,
    
  •        mca_shmem_mmap_component.super.base_version.mca_component_name,
    
  •        ds_buf->seg_id, (unsigned long)ds_buf->seg_size, ds_buf->seg_name
     );
    

    }

@@ -417,9 +417,19 @@
{
pid_t my_pid = getpid();

  • opal_output_verbose(

  •    70, opal_shmem_base_framework.framework_output,
    
  •    "segment_attach: my_pid=%d seg_cpid=%d",
    
  •    my_pid, ds_buf->seg_cpid
    
  • );

  • opal_output_verbose(

  •    70, opal_shmem_base_framework.framework_output,
    
  •    "segment_attach: %s",
    
  •    ds_buf->seg_name
    
  • );
    if (my_pid != ds_buf->seg_cpid) {

  •    if (-1 == (ds_buf->seg_id = open(ds_buf->seg_name, O_CREAT | O_RDWR,
    
  •                                     0600))) {
    
  •    if (-1 == (ds_buf->seg_id = open(ds_buf->seg_name, O_RDWR
    
  •                                    ))) {
         int err = errno;
         char hn[MAXHOSTNAMELEN];
         gethostname(hn, MAXHOSTNAMELEN - 1);
    

    @@ -461,13 +471,13 @@

    • work was done in segment_create :-).
      */
  • OPAL_OUTPUT_VERBOSE(

  •    (70, opal_shmem_base_framework.framework_output,
    
  •     "%s: %s: attach successful "
    
  •     "(id: %d, size: %lu, name: %s)\n",
    
  •     mca_shmem_mmap_component.super.base_version.mca_type_name,
    
  •     mca_shmem_mmap_component.super.base_version.mca_component_name,
    
  •     ds_buf->seg_id, (unsigned long)ds_buf->seg_size, ds_buf->seg_name)
    
  • opal_output_verbose(

  •    70, opal_shmem_base_framework.framework_output,
    
  •    "%s: %s: attach successful "
    
  •    "(id: %d, size: %lu, name: %s)\n",
    
  •    mca_shmem_mmap_component.super.base_version.mca_type_name,
    
  •    mca_shmem_mmap_component.super.base_version.mca_component_name,
    
  •    ds_buf->seg_id, (unsigned long)ds_buf->seg_size, ds_buf->seg_name
    

    );

    /* update returned base pointer with an offset that hides our stuff */
    @@ -480,13 +490,13 @@
    {
    int rc = OPAL_SUCCESS;

  • OPAL_OUTPUT_VERBOSE(

  •    (70, opal_shmem_base_framework.framework_output,
    
  •     "%s: %s: detaching "
    
  •     "(id: %d, size: %lu, name: %s)\n",
    
  •     mca_shmem_mmap_component.super.base_version.mca_type_name,
    
  •     mca_shmem_mmap_component.super.base_version.mca_component_name,
    
  •     ds_buf->seg_id, (unsigned long)ds_buf->seg_size, ds_buf->seg_name)
    
  • opal_output_verbose(

  •    70, opal_shmem_base_framework.framework_output,
    
  •    "%s: %s: detaching "
    
  •    "(id: %d, size: %lu, name: %s)\n",
    
  •    mca_shmem_mmap_component.super.base_version.mca_type_name,
    
  •    mca_shmem_mmap_component.super.base_version.mca_component_name,
    
  •    ds_buf->seg_id, (unsigned long)ds_buf->seg_size, ds_buf->seg_name
    

    );

    if (0 != munmap((void *)ds_buf->seg_base_addr, ds_buf->seg_size)) {
    @@ -509,13 +519,13 @@
    static int
    segment_unlink(opal_shmem_ds_t *ds_buf)
    {

  • OPAL_OUTPUT_VERBOSE(

  •    (70, opal_shmem_base_framework.framework_output,
    
  •     "%s: %s: unlinking"
    
  •     "(id: %d, size: %lu, name: %s)\n",
    
  •     mca_shmem_mmap_component.super.base_version.mca_type_name,
    
  •     mca_shmem_mmap_component.super.base_version.mca_component_name,
    
  •     ds_buf->seg_id, (unsigned long)ds_buf->seg_size, ds_buf->seg_name)
    
  • opal_output_verbose(

  •    70, opal_shmem_base_framework.framework_output,
    
  •    "%s: %s: unlinking"
    
  •    "(id: %d, size: %lu, name: %s)\n",
    
  •    mca_shmem_mmap_component.super.base_version.mca_type_name,
    
  •    mca_shmem_mmap_component.super.base_version.mca_component_name,
    
  •    ds_buf->seg_id, (unsigned long)ds_buf->seg_size, ds_buf->seg_name
    

    );

    if (-1 == unlink(ds_buf->seg_name)) {

@ggouaillardet
Copy link
Contributor

the patch can be downloaded at https://gist.github.com/ggouaillardet/007876c338ba26ca9d7b

@kcgthb
Copy link
Author

kcgthb commented Oct 21, 2014

@ggouaillardet still got pretty much the same segfault with the patch applied (at least the backtrace looks the same):

$ OMPI_MCA_shmem_base_verbose=255 srun ./send_recv
[sh-5-35.local:42511] mca: base: components_register: registering shmem components
[sh-5-35.local:42512] mca: base: components_register: registering shmem components
[sh-5-35.local:42512] mca: base: components_register: found loaded component mmap
[sh-5-35.local:42512] mca: base: components_register: component mmap register function successful
[sh-5-35.local:42512] mca: base: components_register: found loaded component posix
[sh-5-35.local:42512] mca: base: components_register: component posix register function successful
[sh-5-35.local:42512] mca: base: components_register: found loaded component sysv
[sh-5-35.local:42512] mca: base: components_register: component sysv register function successful
[sh-5-35.local:42512] mca: base: components_open: opening shmem components
[sh-5-35.local:42512] mca: base: components_open: found loaded component mmap
[sh-5-35.local:42512] mca: base: components_open: component mmap open function successful
[sh-5-35.local:42512] mca: base: components_open: found loaded component posix
[sh-5-35.local:42512] mca: base: components_open: component posix open function successful
[sh-5-35.local:42512] mca: base: components_open: found loaded component sysv
[sh-5-35.local:42512] mca: base: components_open: component sysv open function successful
[sh-5-35.local:42512] shmem: base: runtime_query: Auto-selecting shmem components
[sh-5-35.local:42512] shmem: base: runtime_query: (shmem) Querying component (run-time) [mmap]
[sh-5-35.local:42512] shmem: base: runtime_query: (shmem) Query of component [mmap] set priority to 50
[sh-5-35.local:42512] shmem: base: runtime_query: (shmem) Querying component (run-time) [posix]
[sh-5-35.local:42512] shmem: base: runtime_query: (shmem) Query of component [posix] set priority to 40
[sh-5-35.local:42512] shmem: base: runtime_query: (shmem) Querying component (run-time) [sysv]
[sh-5-35.local:42512] shmem: base: runtime_query: (shmem) Query of component [sysv] set priority to 30
[sh-5-35.local:42512] shmem: base: runtime_query: (shmem) Selected component [mmap]
[sh-5-35.local:42511] mca: base: components_register: found loaded component mmap
[sh-5-35.local:42512] mca: base: close: unloading component posix
[sh-5-35.local:42512] mca: base: close: unloading component sysv
[sh-5-35.local:42511] mca: base: components_register: component mmap register function successful
[sh-5-35.local:42511] mca: base: components_register: found loaded component posix
[sh-5-35.local:42511] mca: base: components_register: component posix register function successful
[sh-5-35.local:42511] mca: base: components_register: found loaded component sysv
[sh-5-35.local:42511] mca: base: components_register: component sysv register function successful
[sh-5-35.local:42511] mca: base: components_open: opening shmem components
[sh-5-35.local:42511] mca: base: components_open: found loaded component mmap
[sh-5-35.local:42511] mca: base: components_open: component mmap open function successful
[sh-5-35.local:42511] mca: base: components_open: found loaded component posix
[sh-5-35.local:42511] mca: base: components_open: component posix open function successful
[sh-5-35.local:42511] mca: base: components_open: found loaded component sysv
[sh-5-35.local:42511] mca: base: components_open: component sysv open function successful
[sh-5-35.local:42511] shmem: base: runtime_query: Auto-selecting shmem components
[sh-5-35.local:42511] shmem: base: runtime_query: (shmem) Querying component (run-time) [mmap]
[sh-5-35.local:42511] shmem: base: runtime_query: (shmem) Query of component [mmap] set priority to 50
[sh-5-35.local:42511] shmem: base: runtime_query: (shmem) Querying component (run-time) [posix]
[sh-5-35.local:42511] shmem: base: runtime_query: (shmem) Query of component [posix] set priority to 40
[sh-5-35.local:42511] shmem: base: runtime_query: (shmem) Querying component (run-time) [sysv]
[sh-5-35.local:42511] shmem: base: runtime_query: (shmem) Query of component [sysv] set priority to 30
[sh-5-35.local:42511] shmem: base: runtime_query: (shmem) Selected component [mmap]
[sh-5-35.local:42511] mca: base: close: unloading component posix
[sh-5-35.local:42511] mca: base: close: unloading component sysv
[sh-5-35.local:42511] shmem: mmap: shmem_ds_resetting (id: 0, size:  0, name: )
[sh-5-35.local:42511] shmem: mmap: backing store base directory: /tmp/openmpi-sessions-kilian@sh-5-35_0/37920/4/shared_mem_pool.sh-5-35
[sh-5-35.local:42511] shmem: mmap: create successful (id: 23, size: 134217736, name: /tmp/openmpi-sessions-kilian@sh-5-35_0/37920/4/shared_mem_pool.sh-5-35)
[sh-5-35.local:42511] segment_attach: my_pid=42511 seg_cpid=42511
[sh-5-35.local:42511] segment_attach: /tmp/openmpi-sessions-kilian@sh-5-35_0/37920/4/shared_mem_pool.sh-5-35
[sh-5-35.local:42511] shmem: mmap: attach successful (id: 23, size: 134217736, name: /tmp/openmpi-sessions-kilian@sh-5-35_0/37920/4/shared_mem_pool.sh-5-35)
[sh-5-35.local:42511] shmem: mmap: ds_copy complete from: (id: 23, size: 134217736, name: /tmp/openmpi-sessions-kilian@sh-5-35_0/37920/4/shared_mem_pool.sh-5-35 flags: 0x01) to: (id: 23, size: 134217736, name: /tmp/openmpi-sessions-kilian@sh-5-35_0/37920/4/shared_mem_pool.sh-5-35 flags: 0x01)
[sh-5-35.local:42511] shmem: mmap: shmem_ds_resetting (id: 0, size:  0, name: )
[sh-5-35.local:42511] shmem: mmap: backing store base directory: /tmp/openmpi-sessions-kilian@sh-5-35_0/37920/4/shared_mem_btl_module.sh-5-35
[sh-5-35.local:42511] shmem: mmap: create successful (id: 23, size: 140, name: /tmp/openmpi-sessions-kilian@sh-5-35_0/37920/4/shared_mem_btl_module.sh-5-35)
[sh-5-35.local:42511] segment_attach: my_pid=42511 seg_cpid=42511
[sh-5-35.local:42511] segment_attach: /tmp/openmpi-sessions-kilian@sh-5-35_0/37920/4/shared_mem_btl_module.sh-5-35
[sh-5-35.local:42511] shmem: mmap: attach successful (id: 23, size: 140, name: /tmp/openmpi-sessions-kilian@sh-5-35_0/37920/4/shared_mem_btl_module.sh-5-35)
[sh-5-35.local:42511] shmem: mmap: ds_copy complete from: (id: 23, size: 140, name: /tmp/openmpi-sessions-kilian@sh-5-35_0/37920/4/shared_mem_btl_module.sh-5-35 flags: 0x01) to: (id: 23, size: 140, name: /tmp/openmpi-sessions-kilian@sh-5-35_0/37920/4/shared_mem_btl_module.sh-5-35 flags: 0x01)
[sh-5-35.local:42511] shmem: mmap: shmem_ds_resetting (id: 0, size:  0, name: )
[sh-5-35.local:42511] shmem: mmap: backing store base directory: /tmp/openmpi-sessions-kilian@sh-5-35_0/37920/4/0/vader_segment.sh-5-35.0
[sh-5-35.local:42511] shmem: mmap: create successful (id: 24, size: 4194312, name: /tmp/openmpi-sessions-kilian@sh-5-35_0/37920/4/0/vader_segment.sh-5-35.0)
[sh-5-35.local:42511] segment_attach: my_pid=42511 seg_cpid=42511
[sh-5-35.local:42511] segment_attach: /tmp/openmpi-sessions-kilian@sh-5-35_0/37920/4/0/vader_segment.sh-5-35.0
[sh-5-35.local:42511] shmem: mmap: attach successful (id: 24, size: 4194312, name: /tmp/openmpi-sessions-kilian@sh-5-35_0/37920/4/0/vader_segment.sh-5-35.0)
[sh-5-35.local:42512] shmem: mmap: shmem_ds_resetting (id: 0, size:  0, name: )
[sh-5-35.local:42512] shmem: mmap: backing store base directory: /tmp/openmpi-sessions-kilian@sh-5-35_0/37920/4/1/vader_segment.sh-5-35.1
[sh-5-35.local:42512] shmem: mmap: create successful (id: 24, size: 4194312, name: /tmp/openmpi-sessions-kilian@sh-5-35_0/37920/4/1/vader_segment.sh-5-35.1)
[sh-5-35.local:42512] segment_attach: my_pid=42512 seg_cpid=42512
[sh-5-35.local:42512] segment_attach: /tmp/openmpi-sessions-kilian@sh-5-35_0/37920/4/1/vader_segment.sh-5-35.1
[sh-5-35.local:42512] shmem: mmap: attach successful (id: 24, size: 4194312, name: /tmp/openmpi-sessions-kilian@sh-5-35_0/37920/4/1/vader_segment.sh-5-35.1)
[sh-5-35.local:42511] segment_attach: my_pid=42511 seg_cpid=42511
[sh-5-35.local:42511] segment_attach: /tmp/openmpi-sessions-kilian@sh-5-35_0/37920/4/shared_mem_pool.sh-5-35
[sh-5-35.local:42511] shmem: mmap: attach successful (id: 23, size: 134217736, name: /tmp/openmpi-sessions-kilian@sh-5-35_0/37920/4/shared_mem_pool.sh-5-35)
[sh-5-35.local:42511] shmem: mmap: ds_copy complete from: (id: 23, size: 134217736, name: /tmp/openmpi-sessions-kilian@sh-5-35_0/37920/4/shared_mem_pool.sh-5-35 flags: 0x01) to: (id: 23, size: 134217736, name: /tmp/openmpi-sessions-kilian@sh-5-35_0/37920/4/shared_mem_pool.sh-5-35 flags: 0x01)
[sh-5-35.local:42511] shmem: mmap: unlinking(id: 23, size: 140, name: /tmp/openmpi-sessions-kilian@sh-5-35_0/37920/4/shared_mem_btl_module.sh-5-35)
[sh-5-35.local:42511] shmem: mmap: unlinking(id: 23, size: 134217736, name: /tmp/openmpi-sessions-kilian@sh-5-35_0/37920/4/shared_mem_pool.sh-5-35)
[sh-5-35.local:42512] segment_attach: my_pid=42512 seg_cpid=42511
[sh-5-35.local:42512] segment_attach: /tmp/openmpi-sessions-kilian@sh-5-35_0/37920/4/shared_mem_pool.sh-5-35
[sh-5-35.local:42512] shmem: mmap: attach successful (id: 24, size: 134217736, name: /tmp/openmpi-sessions-kilian@sh-5-35_0/37920/4/shared_mem_pool.sh-5-35)
[sh-5-35.local:42512] shmem: mmap: ds_copy complete from: (id: 24, size: 134217736, name: /tmp/openmpi-sessions-kilian@sh-5-35_0/37920/4/shared_mem_pool.sh-5-35 flags: 0x01) to: (id: 24, size: 134217736, name: /tmp/openmpi-sessions-kilian@sh-5-35_0/37920/4/shared_mem_pool.sh-5-35 flags: 0x01)
[sh-5-35.local:42512] segment_attach: my_pid=42512 seg_cpid=42511
[sh-5-35.local:42512] segment_attach: /tmp/openmpi-sessions-kilian@sh-5-35_0/37920/4/shared_mem_btl_module.sh-5-35
[sh-5-35.local:42512] shmem: mmap: attach successful (id: 29, size: 140, name: /tmp/openmpi-sessions-kilian@sh-5-35_0/37920/4/shared_mem_btl_module.sh-5-35)
[sh-5-35.local:42512] shmem: mmap: ds_copy complete from: (id: 29, size: 140, name: /tmp/openmpi-sessions-kilian@sh-5-35_0/37920/4/shared_mem_btl_module.sh-5-35 flags: 0x01) to: (id: 29, size: 140, name: /tmp/openmpi-sessions-kilian@sh-5-35_0/37920/4/shared_mem_btl_module.sh-5-35 flags: 0x01)
srun: error: sh-5-35: task 0: Segmentation fault

Backtrace:

#0  0x00007ffff20788ea in mca_btl_vader_sendi () from /share/sw/free/openmpi/1.8.3/gcc/4.4/lib/openmpi/mca_btl_vader.so
#1  0x00007ffff1a4008f in mca_pml_ob1_send_inline () from /share/sw/free/openmpi/1.8.3/gcc/4.4/lib/openmpi/mca_pml_ob1.so
#2  0x00007ffff1a41031 in mca_pml_ob1_send () from /share/sw/free/openmpi/1.8.3/gcc/4.4/lib/openmpi/mca_pml_ob1.so
#3  0x00007ffff7b788ca in PMPI_Send () from /share/sw/free/openmpi/1.8.3/gcc/4.4/lib/libmpi.so.1
#4  0x00000000004009de in main ()

@ggouaillardet
Copy link
Contributor

@kcgthb the patch did not contain any bug fix, the goal was to collect some more outputs.

i was able to note that tasks do not mmap the vader_segment from the other task.
that very likely explains the crash, but i still cannot figure out why we get there in the first place.

could you please confirm
OMPI_MCA_btl=vader,self srun -N 1 -n 2 ./send_recv
also produces a crash in your environment ?

i made a new patch (the previous one must be reverted before it can be applied) available at
https://gist.github.com/ggouaillardet/cb4998da2549e749dfd1

could you please apply it and run

OMPI_MCA_btl_base_verbose=255 OMPI_MCA_shmem_base_verbose=255 OMPI_MCA_btl=sm,vader,self srun -N 1 -n 2 ./send_recv

hopefully we will get a better picture of what went wrong

@kcgthb
Copy link
Author

kcgthb commented Oct 22, 2014

@ggouaillardet

could you please confirm
OMPI_MCA_btl=vader,self srun -N 1 -n 2 ./send_recv
also produces a crash in your environment ?

You'll laugh: no.

$ OMPI_MCA_btl=vader,self srun -N 1 -n 2 ./send_recv
Process 1 received number -1 from process 0

And with your new patch (235.2), the results are:

$ OMPI_MCA_btl_base_verbose=255 OMPI_MCA_shmem_base_verbose=255 OMPI_MCA_btl=sm,vader,self srun -N 1 -n 2 ./send_recv
[sh-5-35.local:03853] mca: base: components_register: registering shmem components
[sh-5-35.local:03852] mca: base: components_register: registering shmem components
[sh-5-35.local:03852] mca: base: components_register: found loaded component mmap
[sh-5-35.local:03852] mca: base: components_register: component mmap register function successful
[sh-5-35.local:03852] mca: base: components_register: found loaded component posix
[sh-5-35.local:03853] mca: base: components_register: found loaded component mmap
[sh-5-35.local:03852] mca: base: components_register: component posix register function successful
[sh-5-35.local:03852] mca: base: components_register: found loaded component sysv
[sh-5-35.local:03853] mca: base: components_register: component mmap register function successful
[sh-5-35.local:03853] mca: base: components_register: found loaded component posix
[sh-5-35.local:03853] mca: base: components_register: component posix register function successful
[sh-5-35.local:03853] mca: base: components_register: found loaded component sysv
[sh-5-35.local:03853] mca: base: components_register: component sysv register function successful
[sh-5-35.local:03853] mca: base: components_open: opening shmem components
[sh-5-35.local:03853] mca: base: components_open: found loaded component mmap
[sh-5-35.local:03853] mca: base: components_open: component mmap open function successful
[sh-5-35.local:03853] mca: base: components_open: found loaded component posix
[sh-5-35.local:03853] mca: base: components_open: component posix open function successful
[sh-5-35.local:03853] mca: base: components_open: found loaded component sysv
[sh-5-35.local:03853] mca: base: components_open: component sysv open function successful
[sh-5-35.local:03853] shmem: base: runtime_query: Auto-selecting shmem components
[sh-5-35.local:03853] shmem: base: runtime_query: (shmem) Querying component (run-time) [mmap]
[sh-5-35.local:03853] shmem: base: runtime_query: (shmem) Query of component [mmap] set priority to 50
[sh-5-35.local:03853] shmem: base: runtime_query: (shmem) Querying component (run-time) [posix]
[sh-5-35.local:03853] shmem: base: runtime_query: (shmem) Query of component [posix] set priority to 40
[sh-5-35.local:03853] shmem: base: runtime_query: (shmem) Querying component (run-time) [sysv]
[sh-5-35.local:03853] shmem: base: runtime_query: (shmem) Query of component [sysv] set priority to 30
[sh-5-35.local:03853] shmem: base: runtime_query: (shmem) Selected component [mmap]
[sh-5-35.local:03853] mca: base: close: unloading component posix
[sh-5-35.local:03852] mca: base: components_register: component sysv register function successful
[sh-5-35.local:03852] mca: base: components_open: opening shmem components
[sh-5-35.local:03853] mca: base: close: unloading component sysv
[sh-5-35.local:03852] mca: base: components_open: found loaded component mmap
[sh-5-35.local:03852] mca: base: components_open: component mmap open function successful
[sh-5-35.local:03852] mca: base: components_open: found loaded component posix
[sh-5-35.local:03852] mca: base: components_open: component posix open function successful
[sh-5-35.local:03852] mca: base: components_open: found loaded component sysv
[sh-5-35.local:03852] mca: base: components_open: component sysv open function successful
[sh-5-35.local:03852] shmem: base: runtime_query: Auto-selecting shmem components
[sh-5-35.local:03852] shmem: base: runtime_query: (shmem) Querying component (run-time) [mmap]
[sh-5-35.local:03852] shmem: base: runtime_query: (shmem) Query of component [mmap] set priority to 50
[sh-5-35.local:03852] shmem: base: runtime_query: (shmem) Querying component (run-time) [posix]
[sh-5-35.local:03852] shmem: base: runtime_query: (shmem) Query of component [posix] set priority to 40
[sh-5-35.local:03852] shmem: base: runtime_query: (shmem) Querying component (run-time) [sysv]
[sh-5-35.local:03852] shmem: base: runtime_query: (shmem) Query of component [sysv] set priority to 30
[sh-5-35.local:03852] shmem: base: runtime_query: (shmem) Selected component [mmap]
[sh-5-35.local:03852] mca: base: close: unloading component posix
[sh-5-35.local:03852] mca: base: close: unloading component sysv
[sh-5-35.local:03852] mca: base: components_register: registering btl components
[sh-5-35.local:03852] mca: base: components_register: found loaded component self
[sh-5-35.local:03853] mca: base: components_register: registering btl components
[sh-5-35.local:03852] mca: base: components_register: component self register function successful
[sh-5-35.local:03852] mca: base: components_register: found loaded component sm
[sh-5-35.local:03852] mca: base: components_register: component sm register function successful
[sh-5-35.local:03852] mca: base: components_register: found loaded component vader
[sh-5-35.local:03852] mca: base: components_register: component vader register function successful
[sh-5-35.local:03852] mca: base: components_open: opening btl components
[sh-5-35.local:03852] mca: base: components_open: found loaded component self
[sh-5-35.local:03852] mca: base: components_open: component self open function successful
[sh-5-35.local:03852] mca: base: components_open: found loaded component sm
[sh-5-35.local:03852] mca: base: components_open: component sm open function successful
[sh-5-35.local:03852] mca: base: components_open: found loaded component vader
[sh-5-35.local:03852] mca: base: components_open: component vader open function successful
[sh-5-35.local:03853] mca: base: components_register: found loaded component self
[sh-5-35.local:03853] mca: base: components_register: component self register function successful
[sh-5-35.local:03853] mca: base: components_register: found loaded component sm
[sh-5-35.local:03853] mca: base: components_register: component sm register function successful
[sh-5-35.local:03853] mca: base: components_register: found loaded component vader
[sh-5-35.local:03853] mca: base: components_register: component vader register function successful
[sh-5-35.local:03853] mca: base: components_open: opening btl components
[sh-5-35.local:03853] mca: base: components_open: found loaded component self
[sh-5-35.local:03853] mca: base: components_open: component self open function successful
[sh-5-35.local:03853] mca: base: components_open: found loaded component sm
[sh-5-35.local:03853] mca: base: components_open: component sm open function successful
[sh-5-35.local:03853] mca: base: components_open: found loaded component vader
[sh-5-35.local:03853] mca: base: components_open: component vader open function successful
[sh-5-35.local:03853] select: initializing btl component self
[sh-5-35.local:03852] select: initializing btl component self
[sh-5-35.local:03852] select: init of component self returned success
[sh-5-35.local:03852] select: initializing btl component sm
[sh-5-35.local:03852] shmem: mmap: shmem_ds_resetting (id: 0, size:  0, name: )
[sh-5-35.local:03852] shmem: mmap: backing store base directory: /tmp/openmpi-sessions-kilian@sh-5-35_0/21711/9/shared_mem_pool.sh-5-35
[sh-5-35.local:03853] select: init of component self returned success
[sh-5-35.local:03853] select: initializing btl component sm
[sh-5-35.local:03853] select: init of component sm returned success
[sh-5-35.local:03853] select: initializing btl component vader
[sh-5-35.local:03853] shmem: mmap: shmem_ds_resetting (id: 0, size:  0, name: )
[sh-5-35.local:03853] shmem: mmap: backing store base directory: /tmp/openmpi-sessions-kilian@sh-5-35_0/21711/9/1/vader_segment.sh-5-35.1
[sh-5-35.local:03852] shmem: mmap: create successful (id: 13, size: 134217736, name: /tmp/openmpi-sessions-kilian@sh-5-35_0/21711/9/shared_mem_pool.sh-5-35)
[sh-5-35.local:03853] shmem: mmap: create successful (id: 12, size: 4194312, name: /tmp/openmpi-sessions-kilian@sh-5-35_0/21711/9/1/vader_segment.sh-5-35.1)
[sh-5-35.local:03853] segment_attach: my_pid=3853 seg_cpid=3853
[sh-5-35.local:03853] segment_attach: /tmp/openmpi-sessions-kilian@sh-5-35_0/21711/9/1/vader_segment.sh-5-35.1
[sh-5-35.local:03853] shmem: mmap: attach successful (id: 12, size: 4194312, name: /tmp/openmpi-sessions-kilian@sh-5-35_0/21711/9/1/vader_segment.sh-5-35.1)
[sh-5-35.local:03853] select: init of component vader returned success
[sh-5-35.local:03852] segment_attach: my_pid=3852 seg_cpid=3852
[sh-5-35.local:03852] segment_attach: /tmp/openmpi-sessions-kilian@sh-5-35_0/21711/9/shared_mem_pool.sh-5-35
[sh-5-35.local:03852] shmem: mmap: attach successful (id: 13, size: 134217736, name: /tmp/openmpi-sessions-kilian@sh-5-35_0/21711/9/shared_mem_pool.sh-5-35)
[sh-5-35.local:03852] shmem: mmap: ds_copy complete from: (id: 13, size: 134217736, name: /tmp/openmpi-sessions-kilian@sh-5-35_0/21711/9/shared_mem_pool.sh-5-35 flags: 0x01) to: (id: 13, size: 134217736, name: /tmp/openmpi-sessions-kilian@sh-5-35_0/21711/9/shared_mem_pool.sh-5-35 flags: 0x01)
[sh-5-35.local:03852] shmem: mmap: shmem_ds_resetting (id: 0, size:  0, name: )
[sh-5-35.local:03852] shmem: mmap: backing store base directory: /tmp/openmpi-sessions-kilian@sh-5-35_0/21711/9/shared_mem_btl_module.sh-5-35
[sh-5-35.local:03852] shmem: mmap: create successful (id: 13, size: 140, name: /tmp/openmpi-sessions-kilian@sh-5-35_0/21711/9/shared_mem_btl_module.sh-5-35)
[sh-5-35.local:03852] segment_attach: my_pid=3852 seg_cpid=3852
[sh-5-35.local:03852] segment_attach: /tmp/openmpi-sessions-kilian@sh-5-35_0/21711/9/shared_mem_btl_module.sh-5-35
[sh-5-35.local:03852] shmem: mmap: attach successful (id: 13, size: 140, name: /tmp/openmpi-sessions-kilian@sh-5-35_0/21711/9/shared_mem_btl_module.sh-5-35)
[sh-5-35.local:03852] shmem: mmap: ds_copy complete from: (id: 13, size: 140, name: /tmp/openmpi-sessions-kilian@sh-5-35_0/21711/9/shared_mem_btl_module.sh-5-35 flags: 0x01) to: (id: 13, size: 140, name: /tmp/openmpi-sessions-kilian@sh-5-35_0/21711/9/shared_mem_btl_module.sh-5-35 flags: 0x01)
[sh-5-35.local:03852] select: init of component sm returned success
[sh-5-35.local:03852] select: initializing btl component vader
[sh-5-35.local:03852] shmem: mmap: shmem_ds_resetting (id: 0, size:  0, name: )
[sh-5-35.local:03852] shmem: mmap: backing store base directory: /tmp/openmpi-sessions-kilian@sh-5-35_0/21711/9/0/vader_segment.sh-5-35.0
[sh-5-35.local:03852] shmem: mmap: create successful (id: 13, size: 4194312, name: /tmp/openmpi-sessions-kilian@sh-5-35_0/21711/9/0/vader_segment.sh-5-35.0)
[sh-5-35.local:03852] segment_attach: my_pid=3852 seg_cpid=3852
[sh-5-35.local:03852] segment_attach: /tmp/openmpi-sessions-kilian@sh-5-35_0/21711/9/0/vader_segment.sh-5-35.0
[sh-5-35.local:03852] shmem: mmap: attach successful (id: 13, size: 4194312, name: /tmp/openmpi-sessions-kilian@sh-5-35_0/21711/9/0/vader_segment.sh-5-35.0)
[sh-5-35.local:03852] select: init of component vader returned success
[sh-5-35.local:03852] mca_bml_r2_add_procs 0/3 calling add_procs with btl self
[sh-5-35.local:03852] mca: bml: Using self btl to [[21711,9],0] on node sh-5-35
[sh-5-35.local:03852] mca_bml_r2_add_procs 1/3 calling add_procs with btl vader
[sh-5-35.local:03852] vader_add_procs(nprocs=2)
[sh-5-35.local:03852] vader_add_procs: init_vader_endpoint(proc=0, local_rank=0) => 0
[sh-5-35.local:03852] init_vader_endpoint: modex recv success, seg_name=/tmp/openmpi-sessions-kilian@sh-5-35_0/21711/9/1/vader_segment.sh-5-35.1
[sh-5-35.local:03852] segment_attach: my_pid=3852 seg_cpid=3853
[sh-5-35.local:03852] segment_attach: /tmp/openmpi-sessions-kilian@sh-5-35_0/21711/9/1/vader_segment.sh-5-35.1
[sh-5-35.local:03852] shmem: mmap: attach successful (id: 13, size: 4194312, name: /tmp/openmpi-sessions-kilian@sh-5-35_0/21711/9/1/vader_segment.sh-5-35.1)
[sh-5-35.local:03852] vader_add_procs: init_vader_endpoint(proc=1, local_rank=1) => 0
[sh-5-35.local:03852] mca: bml: Using vader btl to [[21711,9],1] on node sh-5-35
[sh-5-35.local:03852] mca_bml_r2_add_procs 2/3 calling add_procs with btl sm
[sh-5-35.local:03852] segment_attach: my_pid=3852 seg_cpid=3852
[sh-5-35.local:03852] segment_attach: /tmp/openmpi-sessions-kilian@sh-5-35_0/21711/9/shared_mem_pool.sh-5-35
[sh-5-35.local:03852] shmem: mmap: attach successful (id: 13, size: 134217736, name: /tmp/openmpi-sessions-kilian@sh-5-35_0/21711/9/shared_mem_pool.sh-5-35)
[sh-5-35.local:03852] shmem: mmap: ds_copy complete from: (id: 13, size: 134217736, name: /tmp/openmpi-sessions-kilian@sh-5-35_0/21711/9/shared_mem_pool.sh-5-35 flags: 0x01) to: (id: 13, size: 134217736, name: /tmp/openmpi-sessions-kilian@sh-5-35_0/21711/9/shared_mem_pool.sh-5-35 flags: 0x01)
[sh-5-35.local:03852] shmem: mmap: unlinking(id: 13, size: 140, name: /tmp/openmpi-sessions-kilian@sh-5-35_0/21711/9/shared_mem_btl_module.sh-5-35)
[sh-5-35.local:03852] shmem: mmap: unlinking(id: 13, size: 134217736, name: /tmp/openmpi-sessions-kilian@sh-5-35_0/21711/9/shared_mem_pool.sh-5-35)
[sh-5-35.local:03853] mca_bml_r2_add_procs 0/3 calling add_procs with btl self
[sh-5-35.local:03853] mca: bml: Using self btl to [[21711,9],1] on node sh-5-35
[sh-5-35.local:03853] mca_bml_r2_add_procs 1/3 calling add_procs with btl vader
[sh-5-35.local:03853] vader_add_procs(nprocs=2)
[sh-5-35.local:03853] init_vader_endpoint: modex recv success, seg_name=/tmp/openmpi-sessions-kilian@sh-5-35_0/21711/9/0/vader_segment.sh-5-35.0
[sh-5-35.local:03853] segment_attach: my_pid=3853 seg_cpid=3852
[sh-5-35.local:03853] segment_attach: /tmp/openmpi-sessions-kilian@sh-5-35_0/21711/9/0/vader_segment.sh-5-35.0
[sh-5-35.local:03853] shmem: mmap: attach successful (id: 12, size: 4194312, name: /tmp/openmpi-sessions-kilian@sh-5-35_0/21711/9/0/vader_segment.sh-5-35.0)
[sh-5-35.local:03853] vader_add_procs: init_vader_endpoint(proc=0, local_rank=0) => 0
[sh-5-35.local:03853] vader_add_procs: init_vader_endpoint(proc=1, local_rank=1) => 0
[sh-5-35.local:03853] mca: bml: Using vader btl to [[21711,9],0] on node sh-5-35
[sh-5-35.local:03853] mca_bml_r2_add_procs 2/3 calling add_procs with btl sm
[sh-5-35.local:03853] segment_attach: my_pid=3853 seg_cpid=3852
[sh-5-35.local:03853] segment_attach: /tmp/openmpi-sessions-kilian@sh-5-35_0/21711/9/shared_mem_pool.sh-5-35
[sh-5-35.local:03853] shmem: mmap: attach successful (id: 12, size: 134217736, name: /tmp/openmpi-sessions-kilian@sh-5-35_0/21711/9/shared_mem_pool.sh-5-35)
[sh-5-35.local:03853] shmem: mmap: ds_copy complete from: (id: 12, size: 134217736, name: /tmp/openmpi-sessions-kilian@sh-5-35_0/21711/9/shared_mem_pool.sh-5-35 flags: 0x01) to: (id: 12, size: 134217736, name: /tmp/openmpi-sessions-kilian@sh-5-35_0/21711/9/shared_mem_pool.sh-5-35 flags: 0x01)
[sh-5-35.local:03853] segment_attach: my_pid=3853 seg_cpid=3852
[sh-5-35.local:03853] segment_attach: /tmp/openmpi-sessions-kilian@sh-5-35_0/21711/9/shared_mem_btl_module.sh-5-35
[sh-5-35.local:03853] shmem: mmap: attach successful (id: 14, size: 140, name: /tmp/openmpi-sessions-kilian@sh-5-35_0/21711/9/shared_mem_btl_module.sh-5-35)
[sh-5-35.local:03853] shmem: mmap: ds_copy complete from: (id: 14, size: 140, name: /tmp/openmpi-sessions-kilian@sh-5-35_0/21711/9/shared_mem_btl_module.sh-5-35 flags: 0x01) to: (id: 14, size: 140, name: /tmp/openmpi-sessions-kilian@sh-5-35_0/21711/9/shared_mem_btl_module.sh-5-35 flags: 0x01)
[sh-5-35.local:03853] mca: bml: Not using sm btl to [[21711,9],0] on node sh-5-35 because vader btl has higher exclusivity (65536 > 65535)
[sh-5-35.local:03852] mca: bml: Not using sm btl to [[21711,9],1] on node sh-5-35 because vader btl has higher exclusivity (65536 > 65535)
[sh-5-35.local:03852] shmem: mmap: detaching (id: 13, size: 4194312, name: /tmp/openmpi-sessions-kilian@sh-5-35_0/21711/9/1/vader_segment.sh-5-35.1)
[sh-5-35.local:03852] shmem: mmap: shmem_ds_resetting (id: 13, size:  4194312, name: /tmp/openmpi-sessions-kilian@sh-5-35_0/21711/9/1/vader_segment.sh-5-35.1)
[sh-5-35.local:03852] shmem: mmap: unlinking(id: 13, size: 4194312, name: /tmp/openmpi-sessions-kilian@sh-5-35_0/21711/9/0/vader_segment.sh-5-35.0)
[sh-5-35.local:03852] shmem: mmap: detaching (id: -1, size: 4194312, name: /tmp/openmpi-sessions-kilian@sh-5-35_0/21711/9/0/vader_segment.sh-5-35.0)
[sh-5-35.local:03852] shmem: mmap: shmem_ds_resetting (id: -1, size:  4194312, name: /tmp/openmpi-sessions-kilian@sh-5-35_0/21711/9/0/vader_segment.sh-5-35.0)
[sh-5-35.local:03852] mca: base: close: component self closed
[sh-5-35.local:03852] mca: base: close: unloading component self
[sh-5-35.local:03852] shmem: mmap: detaching (id: -1, size: 140, name: /tmp/openmpi-sessions-kilian@sh-5-35_0/21711/9/shared_mem_btl_module.sh-5-35)
[sh-5-35.local:03852] shmem: mmap: shmem_ds_resetting (id: -1, size:  140, name: /tmp/openmpi-sessions-kilian@sh-5-35_0/21711/9/shared_mem_btl_module.sh-5-35)
[sh-5-35.local:03852] mca: base: close: component sm closed
[sh-5-35.local:03852] mca: base: close: unloading component sm
[sh-5-35.local:03852] mca: base: close: component vader closed
[sh-5-35.local:03852] mca: base: close: unloading component vader
[sh-5-35.local:03852] shmem: mmap: detaching (id: -1, size: 134217736, name: /tmp/openmpi-sessions-kilian@sh-5-35_0/21711/9/shared_mem_pool.sh-5-35)
[sh-5-35.local:03852] shmem: mmap: shmem_ds_resetting (id: -1, size:  134217736, name: /tmp/openmpi-sessions-kilian@sh-5-35_0/21711/9/shared_mem_pool.sh-5-35)
[sh-5-35.local:03853] shmem: mmap: detaching (id: 12, size: 4194312, name: /tmp/openmpi-sessions-kilian@sh-5-35_0/21711/9/0/vader_segment.sh-5-35.0)
[sh-5-35.local:03853] shmem: mmap: shmem_ds_resetting (id: 12, size:  4194312, name: /tmp/openmpi-sessions-kilian@sh-5-35_0/21711/9/0/vader_segment.sh-5-35.0)
[sh-5-35.local:03853] shmem: mmap: unlinking(id: 12, size: 4194312, name: /tmp/openmpi-sessions-kilian@sh-5-35_0/21711/9/1/vader_segment.sh-5-35.1)
[sh-5-35.local:03853] shmem: mmap: detaching (id: -1, size: 4194312, name: /tmp/openmpi-sessions-kilian@sh-5-35_0/21711/9/1/vader_segment.sh-5-35.1)
[sh-5-35.local:03853] shmem: mmap: shmem_ds_resetting (id: -1, size:  4194312, name: /tmp/openmpi-sessions-kilian@sh-5-35_0/21711/9/1/vader_segment.sh-5-35.1)
[sh-5-35.local:03853] mca: base: close: component self closed
[sh-5-35.local:03853] mca: base: close: unloading component self
[sh-5-35.local:03853] shmem: mmap: detaching (id: 14, size: 140, name: /tmp/openmpi-sessions-kilian@sh-5-35_0/21711/9/shared_mem_btl_module.sh-5-35)
[sh-5-35.local:03853] shmem: mmap: shmem_ds_resetting (id: 14, size:  140, name: /tmp/openmpi-sessions-kilian@sh-5-35_0/21711/9/shared_mem_btl_module.sh-5-35)
[sh-5-35.local:03853] mca: base: close: component sm closed
[sh-5-35.local:03853] mca: base: close: unloading component sm
[sh-5-35.local:03853] mca: base: close: component vader closed
[sh-5-35.local:03853] mca: base: close: unloading component vader
[sh-5-35.local:03853] shmem: mmap: detaching (id: 12, size: 134217736, name: /tmp/openmpi-sessions-kilian@sh-5-35_0/21711/9/shared_mem_pool.sh-5-35)
[sh-5-35.local:03853] shmem: mmap: shmem_ds_resetting (id: 12, size:  134217736, name: /tmp/openmpi-sessions-kilian@sh-5-35_0/21711/9/shared_mem_pool.sh-5-35)
[sh-5-35.local:03853] mca: base: close: component mmap closed
[sh-5-35.local:03852] mca: base: close: component mmap closed
[sh-5-35.local:03852] mca: base: close: unloading component mmap
[sh-5-35.local:03853] mca: base: close: unloading component mmap
Process 1 received number -1 from process 0

@ggouaillardet
Copy link
Contributor

@kcgthb

and can you now run

OMPI_MCA_btl_base_verbose=255 OMPI_MCA_shmem_base_verbose=255 srun -N 1 -n 2 ./send_recv

/* i "hope" a crash will occur ... */

@kcgthb
Copy link
Author

kcgthb commented Oct 22, 2014

@ggouaillardet Yeah!

$ OMPI_MCA_btl_base_verbose=255 OMPI_MCA_shmem_base_verbose=255 srun -N 1 -n 2 ./send_recv
[sh-5-35.local:06353] mca: base: components_register: registering shmem components
[sh-5-35.local:06353] mca: base: components_register: found loaded component mmap
[sh-5-35.local:06353] mca: base: components_register: component mmap register function successful
[sh-5-35.local:06353] mca: base: components_register: found loaded component posix
[sh-5-35.local:06353] mca: base: components_register: component posix register function successful
[sh-5-35.local:06353] mca: base: components_register: found loaded component sysv
[sh-5-35.local:06353] mca: base: components_register: component sysv register function successful
[sh-5-35.local:06353] mca: base: components_open: opening shmem components
[sh-5-35.local:06353] mca: base: components_open: found loaded component mmap
[sh-5-35.local:06353] mca: base: components_open: component mmap open function successful
[sh-5-35.local:06353] mca: base: components_open: found loaded component posix
[sh-5-35.local:06353] mca: base: components_open: component posix open function successful
[sh-5-35.local:06353] mca: base: components_open: found loaded component sysv
[sh-5-35.local:06353] mca: base: components_open: component sysv open function successful
[sh-5-35.local:06353] shmem: base: runtime_query: Auto-selecting shmem components
[sh-5-35.local:06353] shmem: base: runtime_query: (shmem) Querying component (run-time) [mmap]
[sh-5-35.local:06353] shmem: base: runtime_query: (shmem) Query of component [mmap] set priority to 50
[sh-5-35.local:06353] shmem: base: runtime_query: (shmem) Querying component (run-time) [posix]
[sh-5-35.local:06353] shmem: base: runtime_query: (shmem) Query of component [posix] set priority to 40
[sh-5-35.local:06353] shmem: base: runtime_query: (shmem) Querying component (run-time) [sysv]
[sh-5-35.local:06354] mca: base: components_register: registering shmem components
[sh-5-35.local:06354] mca: base: components_register: found loaded component mmap
[sh-5-35.local:06354] mca: base: components_register: component mmap register function successful
[sh-5-35.local:06354] mca: base: components_register: found loaded component posix
[sh-5-35.local:06354] mca: base: components_register: component posix register function successful
[sh-5-35.local:06354] mca: base: components_register: found loaded component sysv
[sh-5-35.local:06354] mca: base: components_register: component sysv register function successful
[sh-5-35.local:06354] mca: base: components_open: opening shmem components
[sh-5-35.local:06354] mca: base: components_open: found loaded component mmap
[sh-5-35.local:06354] mca: base: components_open: component mmap open function successful
[sh-5-35.local:06354] mca: base: components_open: found loaded component posix
[sh-5-35.local:06354] mca: base: components_open: component posix open function successful
[sh-5-35.local:06354] mca: base: components_open: found loaded component sysv
[sh-5-35.local:06354] mca: base: components_open: component sysv open function successful
[sh-5-35.local:06354] shmem: base: runtime_query: Auto-selecting shmem components
[sh-5-35.local:06354] shmem: base: runtime_query: (shmem) Querying component (run-time) [mmap]
[sh-5-35.local:06354] shmem: base: runtime_query: (shmem) Query of component [mmap] set priority to 50
[sh-5-35.local:06354] shmem: base: runtime_query: (shmem) Querying component (run-time) [posix]
[sh-5-35.local:06354] shmem: base: runtime_query: (shmem) Query of component [posix] set priority to 40
[sh-5-35.local:06354] shmem: base: runtime_query: (shmem) Querying component (run-time) [sysv]
[sh-5-35.local:06353] shmem: base: runtime_query: (shmem) Query of component [sysv] set priority to 30
[sh-5-35.local:06353] shmem: base: runtime_query: (shmem) Selected component [mmap]
[sh-5-35.local:06353] mca: base: close: unloading component posix
[sh-5-35.local:06353] mca: base: close: unloading component sysv
[sh-5-35.local:06354] shmem: base: runtime_query: (shmem) Query of component [sysv] set priority to 30
[sh-5-35.local:06354] shmem: base: runtime_query: (shmem) Selected component [mmap]
[sh-5-35.local:06354] mca: base: close: unloading component posix
[sh-5-35.local:06354] mca: base: close: unloading component sysv
[sh-5-35.local:06353] mca: base: components_register: registering btl components
[sh-5-35.local:06353] mca: base: components_register: found loaded component openib
[sh-5-35.local:06354] mca: base: components_register: registering btl components
[sh-5-35.local:06354] mca: base: components_register: found loaded component openib
[sh-5-35.local:06353] mca: base: components_register: component openib register function successful
[sh-5-35.local:06354] mca: base: components_register: component openib register function successful
[sh-5-35.local:06354] mca: base: components_register: found loaded component self
[sh-5-35.local:06354] mca: base: components_register: component self register function successful
[sh-5-35.local:06353] mca: base: components_register: found loaded component self
[sh-5-35.local:06354] mca: base: components_register: found loaded component sm
[sh-5-35.local:06353] mca: base: components_register: component self register function successful
[sh-5-35.local:06354] mca: base: components_register: component sm register function successful
[sh-5-35.local:06354] mca: base: components_register: found loaded component tcp
[sh-5-35.local:06353] mca: base: components_register: found loaded component sm
[sh-5-35.local:06354] mca: base: components_register: component tcp register function successful
[sh-5-35.local:06354] mca: base: components_register: found loaded component usnic
[sh-5-35.local:06354] mca: base: components_register: component usnic register function successful
[sh-5-35.local:06353] mca: base: components_register: component sm register function successful
[sh-5-35.local:06354] mca: base: components_register: found loaded component vader
[sh-5-35.local:06353] mca: base: components_register: found loaded component tcp
[sh-5-35.local:06354] mca: base: components_register: component vader register function successful
[sh-5-35.local:06354] mca: base: components_open: opening btl components
[sh-5-35.local:06354] mca: base: components_open: found loaded component openib
[sh-5-35.local:06354] mca: base: components_open: component openib open function successful
[sh-5-35.local:06354] mca: base: components_open: found loaded component self
[sh-5-35.local:06354] mca: base: components_open: component self open function successful
[sh-5-35.local:06354] mca: base: components_open: found loaded component sm
[sh-5-35.local:06353] mca: base: components_register: component tcp register function successful
[sh-5-35.local:06354] mca: base: components_open: component sm open function successful
[sh-5-35.local:06353] mca: base: components_register: found loaded component usnic
[sh-5-35.local:06354] mca: base: components_open: found loaded component tcp
[sh-5-35.local:06354] mca: base: components_open: component tcp open function successful
[sh-5-35.local:06354] mca: base: components_open: found loaded component usnic
[sh-5-35.local:06354] mca: base: components_open: component usnic open function successful
[sh-5-35.local:06354] mca: base: components_open: found loaded component vader
[sh-5-35.local:06354] mca: base: components_open: component vader open function successful
[sh-5-35.local:06353] mca: base: components_register: component usnic register function successful
[sh-5-35.local:06353] mca: base: components_register: found loaded component vader
[sh-5-35.local:06353] mca: base: components_register: component vader register function successful
[sh-5-35.local:06353] mca: base: components_open: opening btl components
[sh-5-35.local:06353] mca: base: components_open: found loaded component openib
[sh-5-35.local:06353] mca: base: components_open: component openib open function successful
[sh-5-35.local:06353] mca: base: components_open: found loaded component self
[sh-5-35.local:06353] mca: base: components_open: component self open function successful
[sh-5-35.local:06353] mca: base: components_open: found loaded component sm
[sh-5-35.local:06353] mca: base: components_open: component sm open function successful
[sh-5-35.local:06353] mca: base: components_open: found loaded component tcp
[sh-5-35.local:06353] mca: base: components_open: component tcp open function successful
[sh-5-35.local:06353] mca: base: components_open: found loaded component usnic
[sh-5-35.local:06353] mca: base: components_open: component usnic open function successful
[sh-5-35.local:06353] mca: base: components_open: found loaded component vader
[sh-5-35.local:06353] mca: base: components_open: component vader open function successful
[sh-5-35.local:06354] select: initializing btl component openib
[sh-5-35.local:06353] select: initializing btl component openib
[sh-5-35.local:06353] openib BTL: rdmacm CPC available for use on mlx4_0:1
[sh-5-35.local:06354] openib BTL: rdmacm CPC available for use on mlx4_0:1
[sh-5-35.local:06353] [rank=0] openib: using port mlx4_0:1
[sh-5-35.local:06353] select: init of component openib returned success
[sh-5-35.local:06353] select: initializing btl component self
[sh-5-35.local:06353] select: init of component self returned success
[sh-5-35.local:06353] select: initializing btl component sm
[sh-5-35.local:06353] shmem: mmap: shmem_ds_resetting (id: 0, size:  0, name: )
[sh-5-35.local:06353] shmem: mmap: backing store base directory: /tmp/openmpi-sessions-kilian@sh-5-35_0/21711/10/shared_mem_pool.sh-5-35
[sh-5-35.local:06353] shmem: mmap: create successful (id: 23, size: 134217736, name: /tmp/openmpi-sessions-kilian@sh-5-35_0/21711/10/shared_mem_pool.sh-5-35)
[sh-5-35.local:06353] segment_attach: my_pid=6353 seg_cpid=6353
[sh-5-35.local:06353] segment_attach: /tmp/openmpi-sessions-kilian@sh-5-35_0/21711/10/shared_mem_pool.sh-5-35
[sh-5-35.local:06353] shmem: mmap: attach successful (id: 23, size: 134217736, name: /tmp/openmpi-sessions-kilian@sh-5-35_0/21711/10/shared_mem_pool.sh-5-35)
[sh-5-35.local:06353] shmem: mmap: ds_copy complete from: (id: 23, size: 134217736, name: /tmp/openmpi-sessions-kilian@sh-5-35_0/21711/10/shared_mem_pool.sh-5-35 flags: 0x01) to: (id: 23, size: 134217736, name: /tmp/openmpi-sessions-kilian@sh-5-35_0/21711/10/shared_mem_pool.sh-5-35 flags: 0x01)
[sh-5-35.local:06353] shmem: mmap: shmem_ds_resetting (id: 0, size:  0, name: )
[sh-5-35.local:06353] shmem: mmap: backing store base directory: /tmp/openmpi-sessions-kilian@sh-5-35_0/21711/10/shared_mem_btl_module.sh-5-35
[sh-5-35.local:06353] shmem: mmap: create successful (id: 23, size: 140, name: /tmp/openmpi-sessions-kilian@sh-5-35_0/21711/10/shared_mem_btl_module.sh-5-35)
[sh-5-35.local:06353] segment_attach: my_pid=6353 seg_cpid=6353
[sh-5-35.local:06353] segment_attach: /tmp/openmpi-sessions-kilian@sh-5-35_0/21711/10/shared_mem_btl_module.sh-5-35
[sh-5-35.local:06353] shmem: mmap: attach successful (id: 23, size: 140, name: /tmp/openmpi-sessions-kilian@sh-5-35_0/21711/10/shared_mem_btl_module.sh-5-35)
[sh-5-35.local:06353] shmem: mmap: ds_copy complete from: (id: 23, size: 140, name: /tmp/openmpi-sessions-kilian@sh-5-35_0/21711/10/shared_mem_btl_module.sh-5-35 flags: 0x01) to: (id: 23, size: 140, name: /tmp/openmpi-sessions-kilian@sh-5-35_0/21711/10/shared_mem_btl_module.sh-5-35 flags: 0x01)
[sh-5-35.local:06353] select: init of component sm returned success
[sh-5-35.local:06353] select: initializing btl component tcp
[sh-5-35.local:06354] [rank=1] openib: using port mlx4_0:1
[sh-5-35.local:06354] select: init of component openib returned success
[sh-5-35.local:06354] select: initializing btl component self
[sh-5-35.local:06354] select: init of component self returned success
[sh-5-35.local:06354] select: initializing btl component sm
[sh-5-35.local:06354] select: init of component sm returned success
[sh-5-35.local:06354] select: initializing btl component tcp
[sh-5-35.local:06354] btl: tcp: Searching for exclude address+prefix: 127.0.0.1 / 8
[sh-5-35.local:06354] btl: tcp: Found match: 127.0.0.1 (lo)
[sh-5-35.local:06354] select: init of component tcp returned success
[sh-5-35.local:06354] select: initializing btl component usnic
[sh-5-35.local:06354] found 1 verbs interface
[sh-5-35.local:06354] examining verbs interface: mlx4_0
[sh-5-35.local:06353] btl: tcp: Searching for exclude address+prefix: 127.0.0.1 / 8
[sh-5-35.local:06353] btl: tcp: Found match: 127.0.0.1 (lo)
[sh-5-35.local:06353] select: init of component tcp returned success
[sh-5-35.local:06353] select: initializing btl component usnic
[sh-5-35.local:06353] found 1 verbs interface
[sh-5-35.local:06353] examining verbs interface: mlx4_0
[sh-5-35.local:06354] found 1 verbs interface
[sh-5-35.local:06354] examining verbs interface: mlx4_0
[sh-5-35.local:06353] found 1 verbs interface
[sh-5-35.local:06353] examining verbs interface: mlx4_0
[sh-5-35.local:06353] btl:usnic: no usNICs found
[sh-5-35.local:06353] select: init of component usnic returned failure
[sh-5-35.local:06353] mca: base: close: component usnic closed
[sh-5-35.local:06353] mca: base: close: unloading component usnic
[sh-5-35.local:06354] btl:usnic: no usNICs found
[sh-5-35.local:06354] select: init of component usnic returned failure
[sh-5-35.local:06354] mca: base: close: component usnic closed
[sh-5-35.local:06354] mca: base: close: unloading component usnic
[sh-5-35.local:06354] select: initializing btl component vader
[sh-5-35.local:06354] shmem: mmap: shmem_ds_resetting (id: 0, size:  0, name: )
[sh-5-35.local:06354] shmem: mmap: backing store base directory: /tmp/openmpi-sessions-kilian@sh-5-35_0/21711/10/1/vader_segment.sh-5-35.1
[sh-5-35.local:06354] shmem: mmap: create successful (id: 24, size: 4194312, name: /tmp/openmpi-sessions-kilian@sh-5-35_0/21711/10/1/vader_segment.sh-5-35.1)
[sh-5-35.local:06354] segment_attach: my_pid=6354 seg_cpid=6354
[sh-5-35.local:06354] segment_attach: /tmp/openmpi-sessions-kilian@sh-5-35_0/21711/10/1/vader_segment.sh-5-35.1
[sh-5-35.local:06354] shmem: mmap: attach successful (id: 24, size: 4194312, name: /tmp/openmpi-sessions-kilian@sh-5-35_0/21711/10/1/vader_segment.sh-5-35.1)
[sh-5-35.local:06354] select: init of component vader returned success
[sh-5-35.local:06353] select: initializing btl component vader
[sh-5-35.local:06353] shmem: mmap: shmem_ds_resetting (id: 0, size:  0, name: )
[sh-5-35.local:06353] shmem: mmap: backing store base directory: /tmp/openmpi-sessions-kilian@sh-5-35_0/21711/10/0/vader_segment.sh-5-35.0
[sh-5-35.local:06353] shmem: mmap: create successful (id: 24, size: 4194312, name: /tmp/openmpi-sessions-kilian@sh-5-35_0/21711/10/0/vader_segment.sh-5-35.0)
[sh-5-35.local:06353] segment_attach: my_pid=6353 seg_cpid=6353
[sh-5-35.local:06353] segment_attach: /tmp/openmpi-sessions-kilian@sh-5-35_0/21711/10/0/vader_segment.sh-5-35.0
[sh-5-35.local:06353] shmem: mmap: attach successful (id: 24, size: 4194312, name: /tmp/openmpi-sessions-kilian@sh-5-35_0/21711/10/0/vader_segment.sh-5-35.0)
[sh-5-35.local:06353] select: init of component vader returned success
[sh-5-35.local:06353] mca_bml_r2_add_procs 0/6 calling add_procs with btl self
[sh-5-35.local:06354] mca_bml_r2_add_procs 0/6 calling add_procs with btl self
[sh-5-35.local:06353] mca: bml: Using self btl to [[21711,10],0] on node sh-5-35
[sh-5-35.local:06353] mca_bml_r2_add_procs 1/6 calling add_procs with btl vader
[sh-5-35.local:06353] vader_add_procs(nprocs=2)
[sh-5-35.local:06354] mca: bml: Using self btl to [[21711,10],1] on node sh-5-35
[sh-5-35.local:06354] mca_bml_r2_add_procs 1/6 calling add_procs with btl vader
[sh-5-35.local:06353] vader_add_procs: init_vader_endpoint(proc=0, local_rank=0) => 0
[sh-5-35.local:06354] vader_add_procs(nprocs=2)
[sh-5-35.local:06353] init_vader_endpoint: modex recv failed -48
[sh-5-35.local:06353] vader_add_procs: init_vader_endpoint(proc=1, local_rank=1) => -48
[sh-5-35.local:06353] mca: bml: Using vader btl to [[21711,10],1] on node sh-5-35
[sh-5-35.local:06353] mca_bml_r2_add_procs 2/6 calling add_procs with btl sm
[sh-5-35.local:06353] segment_attach: my_pid=6353 seg_cpid=6353
[sh-5-35.local:06353] segment_attach: /tmp/openmpi-sessions-kilian@sh-5-35_0/21711/10/shared_mem_pool.sh-5-35
[sh-5-35.local:06353] shmem: mmap: attach successful (id: 23, size: 134217736, name: /tmp/openmpi-sessions-kilian@sh-5-35_0/21711/10/shared_mem_pool.sh-5-35)
[sh-5-35.local:06353] shmem: mmap: ds_copy complete from: (id: 23, size: 134217736, name: /tmp/openmpi-sessions-kilian@sh-5-35_0/21711/10/shared_mem_pool.sh-5-35 flags: 0x01) to: (id: 23, size: 134217736, name: /tmp/openmpi-sessions-kilian@sh-5-35_0/21711/10/shared_mem_pool.sh-5-35 flags: 0x01)
[sh-5-35.local:06354] init_vader_endpoint: modex recv failed -48
[sh-5-35.local:06354] vader_add_procs: init_vader_endpoint(proc=0, local_rank=0) => -48
[sh-5-35.local:06354] vader_add_procs: init_vader_endpoint(proc=1, local_rank=1) => 0
[sh-5-35.local:06354] mca: bml: Using vader btl to [[21711,10],0] on node sh-5-35
[sh-5-35.local:06354] mca_bml_r2_add_procs 2/6 calling add_procs with btl sm
[sh-5-35.local:06354] segment_attach: my_pid=6354 seg_cpid=6353
[sh-5-35.local:06354] segment_attach: /tmp/openmpi-sessions-kilian@sh-5-35_0/21711/10/shared_mem_pool.sh-5-35
[sh-5-35.local:06354] shmem: mmap: attach successful (id: 24, size: 134217736, name: /tmp/openmpi-sessions-kilian@sh-5-35_0/21711/10/shared_mem_pool.sh-5-35)
[sh-5-35.local:06354] shmem: mmap: ds_copy complete from: (id: 24, size: 134217736, name: /tmp/openmpi-sessions-kilian@sh-5-35_0/21711/10/shared_mem_pool.sh-5-35 flags: 0x01) to: (id: 24, size: 134217736, name: /tmp/openmpi-sessions-kilian@sh-5-35_0/21711/10/shared_mem_pool.sh-5-35 flags: 0x01)
[sh-5-35.local:06354] segment_attach: my_pid=6354 seg_cpid=6353
[sh-5-35.local:06354] segment_attach: /tmp/openmpi-sessions-kilian@sh-5-35_0/21711/10/shared_mem_btl_module.sh-5-35
[sh-5-35.local:06354] shmem: mmap: attach successful (id: 29, size: 140, name: /tmp/openmpi-sessions-kilian@sh-5-35_0/21711/10/shared_mem_btl_module.sh-5-35)
[sh-5-35.local:06354] shmem: mmap: ds_copy complete from: (id: 29, size: 140, name: /tmp/openmpi-sessions-kilian@sh-5-35_0/21711/10/shared_mem_btl_module.sh-5-35 flags: 0x01) to: (id: 29, size: 140, name: /tmp/openmpi-sessions-kilian@sh-5-35_0/21711/10/shared_mem_btl_module.sh-5-35 flags: 0x01)
[sh-5-35.local:06353] shmem: mmap: unlinking(id: 23, size: 140, name: /tmp/openmpi-sessions-kilian@sh-5-35_0/21711/10/shared_mem_btl_module.sh-5-35)
[sh-5-35.local:06354] mca: bml: Not using sm btl to [[21711,10],0] on node sh-5-35 because vader btl has higher exclusivity (65536 > 65535)
[sh-5-35.local:06353] shmem: mmap: unlinking(id: 23, size: 134217736, name: /tmp/openmpi-sessions-kilian@sh-5-35_0/21711/10/shared_mem_pool.sh-5-35)
[sh-5-35.local:06354] mca_bml_r2_add_procs 3/6 calling add_procs with btl openib
[sh-5-35.local:06353] mca: bml: Not using sm btl to [[21711,10],1] on node sh-5-35 because vader btl has higher exclusivity (65536 > 65535)
[sh-5-35.local:06353] mca_bml_r2_add_procs 3/6 calling add_procs with btl openib
[sh-5-35.local:06353] mca: bml: Not using openib btl to [[21711,10],1] on node sh-5-35 because vader btl has higher exclusivity (65536 > 1024)
[sh-5-35.local:06353] mca_bml_r2_add_procs 4/6 calling add_procs with btl tcp
[sh-5-35.local:06354] mca: bml: Not using openib btl to [[21711,10],0] on node sh-5-35 because vader btl has higher exclusivity (65536 > 1024)
[sh-5-35.local:06354] mca_bml_r2_add_procs 4/6 calling add_procs with btl tcp
[sh-5-35.local:06354] mca: bml: Not using tcp btl to [[21711,10],0] on node sh-5-35 because vader btl has higher exclusivity (65536 > 100)
[sh-5-35.local:06354] mca_bml_r2_add_procs 5/6 calling add_procs with btl tcp
[sh-5-35.local:06354] mca: bml: Not using tcp btl to [[21711,10],0] on node sh-5-35 because vader btl has higher exclusivity (65536 > 100)
[sh-5-35.local:06353] mca: bml: Not using tcp btl to [[21711,10],1] on node sh-5-35 because vader btl has higher exclusivity (65536 > 100)
[sh-5-35.local:06353] mca_bml_r2_add_procs 5/6 calling add_procs with btl tcp
[sh-5-35.local:06353] mca: bml: Not using tcp btl to [[21711,10],1] on node sh-5-35 because vader btl has higher exclusivity (65536 > 100)
srun: error: sh-5-35: task 0: Segmentation fault

@ggouaillardet
Copy link
Contributor

@kcgthb
[sh-5-35.local:06353] init_vader_endpoint: modex recv failed -48
this explains everything, but i do not know why the modex fails ...

can you please run

OMPI_MCA_btl_base_verbose=255 OMPI_MCA_shmem_base_verbose=255 OMPI_MCA_btl=^usnic
srun -N 1 -n 2 ./send_recv

OMPI_MCA_btl_base_verbose=255 OMPI_MCA_shmem_base_verbose=255 OMPI_MCA_btl=^openib srun -N 1 -n 2 ./send_recv

OMPI_MCA_btl_base_verbose=255 OMPI_MCA_shmem_base_verbose=255 OMPI_MCA_btl=^tcp srun -N 1 -n 2 ./send_recv

and attach the logs only if a crash occurs

@ggouaillardet
Copy link
Contributor

@kcgthb
i was finally able to reproduce the issue :

  • the host has an IB card
  • usnic was compiled
  • no usnic on board
    OMPI_MCA_btl=usnic,vader,self ...
    is enough to reproduce the issue

if you do not have any usnic adapter, the simplest workaround is
OMPI_MCA_btl=^usnic ...

i am now looking for the root cause of the issue

@ggouaillardet
Copy link
Contributor

@kcgthb

This has been fixed on master by commit 7508c6f
and backported for v1.8 in Pull Request open-mpi/ompi-release#42

If the usnic btl is used on a system where no usnic is detected, a NULL modex object is sent.
This NULL object was not correctly handled when received, the modex object sent by the vader btl was not handled and that ended up causing a crash.

@hjelmn the vader btl is not the root cause of the crash even if some error were not correctly reported to the upper layer

@kcgthb
Copy link
Author

kcgthb commented Oct 22, 2014

@ggouaillardet I can confirm than patch 42 (commit 7508c6f) fixes the problem. No segfault anymore.
Thanks a lot!

@rhc54 rhc54 closed this as completed Oct 28, 2014
jsquyres pushed a commit to jsquyres/ompi that referenced this issue Sep 21, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants