Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OMPI/OSC/UCX: fix issue in impl of MPI_Win_create_dynamic/MPI_Win_attach/MPI_Win_detach #5094

Merged
merged 1 commit into from
May 2, 2018

Conversation

xinzhao3
Copy link
Contributor

The original impl for MPI_Win_create_dynamic is not correct. This patch redo this implementation.

Signed-off-by: Xin Zhao xinz@mellanox.com

…ach/MPI_Win_detach

Signed-off-by: Xin Zhao <xinz@mellanox.com>
@xinzhao3
Copy link
Contributor Author

xinzhao3 commented Apr 24, 2018

test code: aint.c (from MPICH test suite)

#include <stdio.h>
#include <mpi.h>
int main(int argc, char **argv)
{
    int rank, nproc;`
    int errs = 0;
    int array[1024];
    int val = 0;
    int target_rank;
    MPI_Aint bases[2];
    MPI_Aint disp, offset;
    MPI_Win win;

    MPI_Init(&argc, &argv);

    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    MPI_Comm_size(MPI_COMM_WORLD, &nproc);

    if (rank == 0 && nproc != 2) {
        printf("Must run with 2 ranks\n");
    }

    /* Get the base address in the middle of the array */
    if (rank == 0) {
        target_rank = 1;
        array[0] = 1234;
        MPI_Get_address(&array[512], &bases[0]);
    } else if (rank == 1) {
        target_rank = 0;
        array[1023] = 1234;
        MPI_Get_address(&array[512], &bases[1]);
    }

    /* Exchange bases */
    MPI_Allgather(MPI_IN_PLACE, 0, MPI_DATATYPE_NULL, bases, 1, MPI_AINT, MPI_COMM_WORLD);

    MPI_Win_create_dynamic(MPI_INFO_NULL, MPI_COMM_WORLD, &win);
    MPI_Win_attach(win, array, sizeof(int) * 1024);

    /* Do MPI_Aint addressing arithmetic */
    if (rank == 0) {
        disp = sizeof(int) * 511;
        offset = MPI_Aint_add(bases[1], disp);  /* offset points to array[1023] */
    } else if (rank == 1) {
        disp = sizeof(int) * 512;
        offset = MPI_Aint_diff(bases[0], disp); /* offset points to array[0] */
    }

    /* Get val and verify it */
    MPI_Win_fence(MPI_MODE_NOPRECEDE, win);
    MPI_Get(&val, 1, MPI_INT, target_rank, offset, 1, MPI_INT, win);
    MPI_Win_fence(MPI_MODE_NOSUCCEED, win);

    if (val != 1234) {
        errs++;
        printf("%d -- Got %d, expected 1234\n", rank, val);
    }

    MPI_Win_detach(win, array);

    MPI_Win_free(&win);

    MPI_Finalize();
    return 0;
}

@xinzhao3
Copy link
Contributor Author

fix issue related to #5083

@xinzhao3
Copy link
Contributor Author

@gpaulsen could you review this PR? This is based on ompi/master, after review is done, I will create a PR for ompi/v3.1.x

@jladd-mlnx jladd-mlnx added the bug label Apr 24, 2018
@jladd-mlnx jladd-mlnx requested a review from gpaulsen April 24, 2018 20:18
@jladd-mlnx jladd-mlnx added this to the v3.1.0 milestone Apr 24, 2018
@@ -643,11 +670,116 @@ static int component_select(struct ompi_win_t *win, void **base, size_t size, in
return ret;
}

int ompi_osc_find_attached_region_position(ompi_osc_dynamic_win_info_t *dynamic_wins,
int min_index, int max_index,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might want to make this ompi_osc_ucx_find_attached_region_position

@hjelmn
Copy link
Member

hjelmn commented Apr 25, 2018

From what I can tell this will work but it will be incredibly inefficient. Yet another reason why I recommend a btl. To make this better you will have to re-implement the optimizations already present in osc/rdma :(.

@hjelmn
Copy link
Member

hjelmn commented Apr 25, 2018

That said, this fixes a bug and is sufficient for v3.0.x and v3.1.x.

@gpaulsen
Copy link
Member

I'm working to build / test this now.

@angainor
Copy link

@xinzhao3 Have you tried it with this test code? because it segfaults for me still - 3.1.0rc5 + 5094.diff:

$ mpirun -mca osc ucx -mca osc_base_verbose 100 -np 2 ./a.out 
[1525076961.480723] [login-0-0:17913:0]         shm.c:65   MXM  WARN  Could not open the KNEM device file at /dev/knem : No such file or directory. Won't use knem.
[1525076961.483837] [login-0-0:17912:0]         shm.c:65   MXM  WARN  Could not open the KNEM device file at /dev/knem : No such file or directory. Won't use knem.
[1525076961.484277] [login-0-0:17913:0]         shm.c:65   MXM  WARN  Could not open the KNEM device file at /dev/knem : No such file or directory. Won't use knem.
[1525076961.488155] [login-0-0:17912:0]         shm.c:65   MXM  WARN  Could not open the KNEM device file at /dev/knem : No such file or directory. Won't use knem.
[login-0-0.local:17913] mca: base: components_register: registering framework osc components
[login-0-0.local:17913] mca: base: components_register: found loaded component ucx
[login-0-0.local:17913] mca: base: components_register: component ucx register function successful
[login-0-0.local:17913] mca: base: components_open: opening osc components
[login-0-0.local:17913] mca: base: components_open: found loaded component ucx
[login-0-0.local:17913] mca: base: components_open: component ucx open function successful
[login-0-0.local:17912] mca: base: components_register: registering framework osc components
[login-0-0.local:17912] mca: base: components_register: found loaded component ucx
[login-0-0.local:17912] mca: base: components_register: component ucx register function successful
[login-0-0.local:17912] mca: base: components_open: opening osc components
[login-0-0.local:17912] mca: base: components_open: found loaded component ucx
[login-0-0.local:17912] mca: base: components_open: component ucx open function successful
[login-0-0.local:17913] osc_ucx_component.c:467: ucp_ep_create failed: -6
[login-0-0.local:17912] osc_ucx_component.c:467: ucp_ep_create failed: -6
[login-0-0:17913:0] Caught signal 11 (Segmentation fault)
[login-0-0:17912:0] Caught signal 11 (Segmentation fault)
[1525076961.917268] [login-0-0:17912:0]         select.c:312  UCX  ERROR no atomic operations on registered memory transport to <no debug data>: Unsupported operation
[1525076961.917269] [login-0-0:17913:0]         select.c:312  UCX  ERROR no atomic operations on registered memory transport to <no debug data>: Unsupported operation
==== backtrace ====
==== backtrace ====
 2 0x000000000005a41c mxm_handle_error()  /scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u9-x86-64-MOFED-CHECKER/hpcx_root/src/hpcx-v2.1.0-gcc-MLNX_OFED_LINUX-4.2-1.2.0.0-redhat6.9-x86_64/mxm-v3.7/src/mxm/util/debug/debug.c:641
 3 0x000000000005a58c mxm_error_signal_handler()  /scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u9-x86-64-MOFED-CHECKER/hpcx_root/src/hpcx-v2.1.0-gcc-MLNX_OFED_LINUX-4.2-1.2.0.0-redhat6.9-x86_64/mxm-v3.7/src/mxm/util/debug/debug.c:616
 4 0x000000364e632510 killpg()  ??:0
 5 0x0000000000008c8f component_select()  osc_ucx_component.c:0
 6 0x0000000000054811 ompi_win_create_dynamic()  ??:0
 7 0x000000000007a8ea MPI_Win_create_dynamic()  ??:0
 8 0x0000000000400c02 main()  /usit/abel/u1/marcink/mpitest/win2.c:37
 9 0x000000364e61ed1d __libc_start_main()  ??:0
10 0x0000000000400d31 _start()  ??:0
 2 0x000000000005a41c mxm_handle_error()  /scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u9-x86-64-MOFED-CHECKER/hpcx_root/src/hpcx-v2.1.0-gcc-MLNX_OFED_LINUX-4.2-1.2.0.0-redhat6.9-x86_64/mxm-v3.7/src/mxm/util/debug/debug.c:641
 3 0x000000000005a58c mxm_error_signal_handler()  /scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u9-x86-64-MOFED-CHECKER/hpcx_root/src/hpcx-v2.1.0-gcc-MLNX_OFED_LINUX-4.2-1.2.0.0-redhat6.9-x86_64/mxm-v3.7/src/mxm/util/debug/debug.c:616
 4 0x000000364e632510 killpg()  ??:0
 5 0x0000000000008c8f component_select()  osc_ucx_component.c:0
 6 0x0000000000054811 ompi_win_create_dynamic()  ??:0
 7 0x000000000007a8ea MPI_Win_create_dynamic()  ??:0
 8 0x0000000000400c02 main()  /usit/abel/u1/marcink/mpitest/win2.c:37
 9 0x000000364e61ed1d __libc_start_main()  ??:0
10 0x0000000000400d31 _start()  ??:0
===================
===================
-------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 1 with PID 0 on node login-0-0 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

Copy link
Member

@gpaulsen gpaulsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see the segfault on ppc64le, and all of my previous osc/ucx failures passed on master.

@angainor
Copy link

hm. I'll try again tomorrow again on a clean installation..

@angainor
Copy link

angainor commented May 1, 2018

@xinzhao3 @gpaulsen @jladd-mlnx I confirm that I still get a segfault with the patch on our FDR ConnectX-3 system. I do not get a segfault on our ConnectX-4 installation. If you compare the original report #5083, the crash is different. In particular, the crash is now in component_select:

 5 0x0000000000008c8f component_select()  osc_ucx_component.c:0
[...]
[1525076961.917268] [login-0-0:17912:0]         select.c:312  UCX  ERROR no atomic operations on registered memory transport to <no debug data>: Unsupported operation

@xinzhao3
Copy link
Contributor Author

xinzhao3 commented May 1, 2018

@angainor which test code it is?

@angainor
Copy link

angainor commented May 1, 2018

@xinzhao3 The one you pasted here (the same as I used in the original post)

@gpaulsen
Copy link
Member

gpaulsen commented May 1, 2018

I cherry-picked the fix to the v3.1.x branch, and retested, all looks good.
I'm running on MT27700 ConnectX-4 ppc64le nodes running RHEL 7.3 and MOFED 4.3.1.5.5

@gpaulsen
Copy link
Member

gpaulsen commented May 1, 2018

@angainor, @xinzhao3, @bwbarrett - In our Open MPI Web-Ex today, we decided to turn the priority of this (OSC UCX) component down to 0 default for the v3.1.x release that we're trying to get out the door. @xinzhao3 will make that PR separately.
I'd hope that we could get this PR in soon to at least address the Connect-X4+ for the v3.1.0 release (again default off due to priority 0), and then we can work to address older HCAs for v3.1.1. @hjelmn your thoughts?

@bwbarrett
Copy link
Member

Do we have root cause on why this is causing a different failure on older hardware? Even with a query value set to 0, there's an awful lot of code still being executed in this path, some of which might be the buggy code causing the segfault. Without understanding that segfault, I think we need to consider not shipping the component in 3.1.0.

@angainor
Copy link

angainor commented May 2, 2018

@xinzhao3 @jladd-mlnx Can you reproduce my problems? Or might this be my system configuration related to one-sided comm, and not a general ConnectX-3 issue? At least it seems that this is related to UCX itself, and not to this particular PR. In the tests below I've removed the 5094 patch. Running on ConnectX-3, and using hpcx-2.1 release, so at least UCX is up to date.

Here I'm trying to run a simple OpenSHMEM program with ucx spml, and get the same error:

$ shmemrun -np 2 -map-by node -mca spml ucx ./a.out
[compute-9-5.local:22635] Error spml_ucx.c:293 - mca_spml_ucx_add_procs() ucp_ep_create failed: Destination is unreachable
[compute-9-5:22635] *** Process received signal ***
[compute-9-5.local:22636] Error spml_ucx.c:293 - mca_spml_ucx_add_procs() ucp_ep_create failed: Destination is unreachable
[compute-9-5:22636] *** Process received signal ***
[compute-9-5:22636] Signal: Aborted (6)
[compute-9-5:22636] Signal code:  (-6)
[compute-9-5:22635] Signal: Aborted (6)
[compute-9-5:22635] Signal code:  (-6)
[1525247684.579983] [compute-9-5:22635:0]         select.c:312  UCX  ERROR no atomic operations on registered memory transport to <no debug data>: Unsupported operation
[1525247684.579983] [compute-9-5:22636:0]         select.c:312  UCX  ERROR no atomic operations on registered memory transport to <no debug data>: Unsupported operation

If I change the transport from default rc to UCX_TLS=ud,self, I get a different error (Destination unreachable, this time with OSU benchmark)

$ mpirun -mca osc ucx -mca pml ucx -np 2 -map-by node -x UCX_TLS=ud,self one-sided/osu_get_bw 
# OSU MPI_Get Bandwidth Test v5.3.2
# Window creation: MPI_Win_allocate
# Synchronization: MPI_Win_flush
# Size      Bandwidth (MB/s)
[compute-9-5:25512:0] Caught signal 11 (Segmentation fault)
[compute-9-21:1580 :0] Caught signal 11 (Segmentation fault)
[1525249603.653381] [compute-9-21:1580 :0]         select.c:312  UCX  ERROR no remote registered memory access transport to <no debug data>: rdmacm/sockaddr - no pending, ud/mlx4_0:1 - no put short, self/self - Destination is unreachable
[1525249603.650098] [compute-9-5:25512:0]         select.c:312  UCX  ERROR no remote registered memory access transport to <no debug data>: rdmacm/sockaddr - no pending, ud/mlx4_0:1 - no put short, self/self - Destination is unreachable

I'm compiling OpenMPI 3.0.1 now to double-check.

@angainor
Copy link

angainor commented May 2, 2018

@xinzhao3 @jladd-mlnx Well, same thing happens with home-compiled 3.0.1, and with the precompiled version shipped with hpcx 2.1.

The OpenSHMEM code starts fine if I use -mca spml ikrit. Any ideas?

@jladd-mlnx
Copy link
Member

@angainor what MOFED version are you running on your CX-3 system? It looks like HCA based atomics aren't working on that system.

@jladd-mlnx
Copy link
Member

The official statement is that UCX requires extended atomics to support OpenSHMEM semantics, which are not present on the ConnectX-3 device.

@jsquyres
Copy link
Member

jsquyres commented May 2, 2018

@jladd-mlnx Fair enough; no support on older devices is a perfectly valid vendor decision. But it still probably shouldn't abort in the presence of an older device.

@bwbarrett
Copy link
Member

How did we switch from MPI onesided to shmem in this pull request discussion? Let's stick to the onesided issue; if there are shmem issues, please file a bug report and we'll track those.

Back to the onesided fix; @jladd-mlnx and @xinzhao3, how comfortable are you with the crash on CX-3 not being a problem if we set priority to 0 instead of disabling the component entirely? Like to close on that point before we merge the PR.

@bwbarrett bwbarrett removed this from the v3.1.0 milestone May 2, 2018
@angainor
Copy link

angainor commented May 2, 2018

@bwbarrett @jladd-mlnx Sorry for mentioning OpenSHMEM, this was only to show that the error I'm getting comes from UCX, and is not OpenMPI specific. But @jsquyres is right in that the code in this example shouldn't segfault when executed without any osc options, but it does: mpirun -np 2 -mca osc_base_verbose 100 ./a.out with 3.1.0 segfaults during component selection. And the same happens for OpenSHMEM.

$ mpirun -np 2 -mca osc_base_verbose 100 ./a.out
[compute-2-1.local:29028] mca: base: components_register: registering framework osc components
[compute-2-1.local:29028] mca: base: components_register: found loaded component monitoring
[compute-2-1.local:29028] mca: base: components_register: component monitoring register function successful
[compute-2-1.local:29027] mca: base: components_register: registering framework osc components
[compute-2-1.local:29027] mca: base: components_register: found loaded component monitoring
[compute-2-1.local:29027] mca: base: components_register: component monitoring register function successful
[compute-2-1.local:29028] mca: base: components_register: found loaded component sm
[compute-2-1.local:29028] mca: base: components_register: component sm has no register or open function
[compute-2-1.local:29027] mca: base: components_register: found loaded component sm
[compute-2-1.local:29027] mca: base: components_register: component sm has no register or open function
[compute-2-1.local:29028] mca: base: components_register: found loaded component pt2pt
[compute-2-1.local:29027] mca: base: components_register: found loaded component pt2pt
[compute-2-1.local:29028] mca: base: components_register: component pt2pt register function successful
[compute-2-1.local:29027] mca: base: components_register: component pt2pt register function successful
[compute-2-1.local:29028] mca: base: components_register: found loaded component ucx
[compute-2-1.local:29028] mca: base: components_register: component ucx register function successful
[compute-2-1.local:29027] mca: base: components_register: found loaded component ucx
[compute-2-1.local:29027] mca: base: components_register: component ucx register function successful
[compute-2-1.local:29028] mca: base: components_register: found loaded component rdma
[compute-2-1.local:29027] mca: base: components_register: found loaded component rdma
[compute-2-1.local:29028] mca: base: components_register: component rdma register function successful
[compute-2-1.local:29027] mca: base: components_register: component rdma register function successful
[compute-2-1.local:29028] mca: base: components_open: opening osc components
[compute-2-1.local:29028] mca: base: components_open: found loaded component monitoring
[compute-2-1.local:29027] mca: base: components_open: opening osc components
[compute-2-1.local:29027] mca: base: components_open: found loaded component monitoring
[compute-2-1.local:29027] mca: base: components_open: found loaded component sm
[compute-2-1.local:29027] mca: base: components_open: component sm open function successful
[compute-2-1.local:29027] mca: base: components_open: found loaded component pt2pt
[compute-2-1.local:29028] mca: base: components_open: found loaded component sm
[compute-2-1.local:29028] mca: base: components_open: component sm open function successful
[compute-2-1.local:29028] mca: base: components_open: found loaded component pt2pt
[compute-2-1.local:29028] mca: base: components_open: found loaded component ucx
[compute-2-1.local:29028] mca: base: components_open: component ucx open function successful
[compute-2-1.local:29028] mca: base: components_open: found loaded component rdma
[compute-2-1.local:29027] mca: base: components_open: found loaded component ucx
[compute-2-1.local:29027] mca: base: components_open: component ucx open function successful
[compute-2-1.local:29027] mca: base: components_open: found loaded component rdma
[compute-2-1.local:29028] mca: base: close: unloading component monitoring
[compute-2-1.local:29027] mca: base: close: unloading component monitoring
[compute-2-1.local:29027] osc_ucx_component.c:467: ucp_ep_create failed: -6
[compute-2-1.local:29028] osc_ucx_component.c:467: ucp_ep_create failed: -6
[compute-2-1:29028:0] Caught signal 11 (Segmentation fault)
[compute-2-1:29027:0] Caught signal 11 (Segmentation fault)
[1525278335.604818] [compute-2-1:29027:0]         select.c:312  UCX  ERROR no atomic operations on registered memory transport to <no debug data>: Unsupported operation
[1525278335.604819] [compute-2-1:29028:0]         select.c:312  UCX  ERROR no atomic operations on registered memory transport to <no debug data>: Unsupported operation
==== backtrace ====
 2 0x000000000005a41c mxm_handle_error()  /scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u9-x86-64-MOFED-CHECKER/hpcx_root/src/hpcx-v2.1.0-gcc-MLNX_OFED_LINUX-4.2-1.2.0.0-redhat6.9-x86_64/mxm-v3.7/src/mxm/util/debug/debug.c:641
 3 0x000000000005a58c mxm_error_signal_handler()  /scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u9-x86-64-MOFED-CHECKER/hpcx_root/src/hpcx-v2.1.0-gcc-MLNX_OFED_LINUX-4.2-1.2.0.0-redhat6.9-x86_64/mxm-v3.7/src/mxm/util/debug/debug.c:616
 4 0x0000000000032510 killpg()  ??:0
 5 0x0000000000008c8f component_select()  osc_ucx_component.c:0
 6 0x0000000000054323 ompi_win_create()  ??:0
 7 0x000000000007a851 MPI_Win_create()  ??:0
 8 0x0000000000400a64 main()  ??:0
 9 0x000000000001ed1d __libc_start_main()  ??:0
10 0x00000000004008d9 _start()  ??:0

@xinzhao3
Copy link
Contributor Author

xinzhao3 commented May 2, 2018

@bwbarrett we are 100% certain that the crash on CX3 will not be hit if the priority is set to 0.

@angainor
Copy link

angainor commented May 2, 2018

@xinzhao3 I really hate to be saying this, but something is still not right. Most of the time the above test code finishes, but once in a while (1/5?) I still do get a segfault running "default" (mpirun -np 2 -mca osc_base_verbose 100 ./a.out):

$ mpirun -np 2 -mca osc_base_verbose 100  ./a.out
[compute-2-1.local:08327] mca: base: components_register: registering framework osc components
[compute-2-1.local:08327] mca: base: components_register: found loaded component monitoring
[compute-2-1.local:08327] mca: base: components_register: component monitoring register function successful
[compute-2-1.local:08328] mca: base: components_register: registering framework osc components
[compute-2-1.local:08328] mca: base: components_register: found loaded component monitoring
[compute-2-1.local:08328] mca: base: components_register: component monitoring register function successful
[compute-2-1.local:08327] mca: base: components_register: found loaded component sm
[compute-2-1.local:08327] mca: base: components_register: component sm has no register or open function
[compute-2-1.local:08327] mca: base: components_register: found loaded component pt2pt
[compute-2-1.local:08328] mca: base: components_register: found loaded component sm
[compute-2-1.local:08328] mca: base: components_register: component sm has no register or open function
[compute-2-1.local:08328] mca: base: components_register: found loaded component pt2pt
[compute-2-1.local:08327] mca: base: components_register: component pt2pt register function successful
[compute-2-1.local:08327] mca: base: components_register: found loaded component ucx
[compute-2-1.local:08327] mca: base: components_register: component ucx register function successful
[compute-2-1.local:08328] mca: base: components_register: component pt2pt register function successful
[compute-2-1.local:08327] mca: base: components_register: found loaded component rdma
[compute-2-1.local:08328] mca: base: components_register: found loaded component ucx
[compute-2-1.local:08328] mca: base: components_register: component ucx register function successful
[compute-2-1.local:08328] mca: base: components_register: found loaded component rdma
[compute-2-1.local:08327] mca: base: components_register: component rdma register function successful
[compute-2-1.local:08327] mca: base: components_open: opening osc components
[compute-2-1.local:08327] mca: base: components_open: found loaded component monitoring
[compute-2-1.local:08327] mca: base: components_open: found loaded component sm
[compute-2-1.local:08327] mca: base: components_open: component sm open function successful
[compute-2-1.local:08327] mca: base: components_open: found loaded component pt2pt
[compute-2-1.local:08327] mca: base: components_open: found loaded component ucx
[compute-2-1.local:08327] mca: base: components_open: component ucx open function successful
[compute-2-1.local:08327] mca: base: components_open: found loaded component rdma
[compute-2-1.local:08328] mca: base: components_register: component rdma register function successful
[compute-2-1.local:08328] mca: base: components_open: opening osc components
[compute-2-1.local:08328] mca: base: components_open: found loaded component monitoring
[compute-2-1.local:08328] mca: base: components_open: found loaded component sm
[compute-2-1.local:08328] mca: base: components_open: component sm open function successful
[compute-2-1.local:08328] mca: base: components_open: found loaded component pt2pt
[compute-2-1.local:08328] mca: base: components_open: found loaded component ucx
[compute-2-1.local:08328] mca: base: components_open: component ucx open function successful
[compute-2-1.local:08328] mca: base: components_open: found loaded component rdma
[compute-2-1.local:08328] mca: base: close: unloading component monitoring
[compute-2-1.local:08327] mca: base: close: unloading component monitoring
[compute-2-1:8327 :0] Caught signal 11 (Segmentation fault)
==== backtrace ====
 2 0x000000000005a41c mxm_handle_error()  /scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u9-x86-64-MOFED-CHECKER/hpcx_root/src/hpcx-v2.1.0-gcc-MLNX_OFED_LINUX-4.2-1.2.0.0-redhat6.9-x86_64/mxm-v3.7/src/mxm/util/debug/debug.c:641
 3 0x000000000005a58c mxm_error_signal_handler()  /scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u9-x86-64-MOFED-CHECKER/hpcx_root/src/hpcx-v2.1.0-gcc-MLNX_OFED_LINUX-4.2-1.2.0.0-redhat6.9-x86_64/mxm-v3.7/src/mxm/util/debug/debug.c:616
 4 0x0000000000032510 killpg()  ??:0
 5 0x0000000000129138 __strcmp_sse42()  thread-freeres.c:0
 6 0x0000000000019aa5 ompi_osc_rdma_component_query()  osc_rdma_component.c:0
 7 0x00000000000ace33 ompi_osc_base_select()  ??:0
 8 0x0000000000054811 ompi_win_create_dynamic()  ??:0
 9 0x000000000007a8ea MPI_Win_create_dynamic()  ??:0
10 0x0000000000400c02 main()  /usit/abel/u1/marcink/mpitest/win_bugreport.c:37
11 0x000000000001ed1d __libc_start_main()  ??:0
12 0x0000000000400d31 _start()  ??:0

@jsquyres
Copy link
Member

jsquyres commented May 2, 2018

@angainor Can you also try with #5134 and verify that with the priority 0 you never get a segv/abort?

@angainor
Copy link

angainor commented May 2, 2018

@jsquyres The new segfault is with both #5094 and #5134 patches. Clearly, with priority change rdma is chosen (you can see this in the back trace). But it still segfaults sometimes.

@hjelmn
Copy link
Member

hjelmn commented May 2, 2018

Very odd. Can't you rebuild with --enable-debug. Might give us a more useful backtrace.

@hjelmn
Copy link
Member

hjelmn commented May 2, 2018

Before that can you run with --mca osc ^monitoring. Just to see what happens. Then try --mca osc pt2pt,rdma,sm.

@hjelmn
Copy link
Member

hjelmn commented May 2, 2018

What might also be interesting to try is --mca pml ob1 just to be sure this isn't something caused by UCX usage there. I doubt it, but we need to narrow this down if possible.

@angainor
Copy link

angainor commented May 2, 2018

@hjelmn It was enough to run mpirun -np 2 -mca osc_base_verbose 100 -mca osc rdma,sm ./a.out:

[compute-2-1.local:11434] mca: base: components_register: registering framework osc components
[compute-2-1.local:11434] mca: base: components_register: found loaded component sm
[compute-2-1.local:11434] mca: base: components_register: component sm has no register or open function
[compute-2-1.local:11433] mca: base: components_register: registering framework osc components
[compute-2-1.local:11433] mca: base: components_register: found loaded component sm
[compute-2-1.local:11433] mca: base: components_register: component sm has no register or open function
[compute-2-1.local:11434] mca: base: components_register: found loaded component rdma
[compute-2-1.local:11433] mca: base: components_register: found loaded component rdma
[compute-2-1.local:11434] mca: base: components_register: component rdma register function successful
[compute-2-1.local:11433] mca: base: components_register: component rdma register function successful
[compute-2-1.local:11434] mca: base: components_open: opening osc components
[compute-2-1.local:11434] mca: base: components_open: found loaded component sm
[compute-2-1.local:11433] mca: base: components_open: opening osc components
[compute-2-1.local:11433] mca: base: components_open: found loaded component sm
[compute-2-1.local:11433] mca: base: components_open: component sm open function successful
[compute-2-1.local:11433] mca: base: components_open: found loaded component rdma
[compute-2-1.local:11434] mca: base: components_open: component sm open function successful
[compute-2-1.local:11434] mca: base: components_open: found loaded component rdma
[compute-2-1:11434:0] Caught signal 11 (Segmentation fault)
==== backtrace ====
 2 0x000000000005a41c mxm_handle_error()  /scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u9-x86-64-MOFED-CHECKER/hpcx_root/src/hpcx-v2.1.0-gcc-MLNX_OFED_LINUX-4.2-1.2.0.0-redhat6.9-x86_64/mxm-v3.7/src/mxm/util/debug/debug.c:641
 3 0x000000000005a58c mxm_error_signal_handler()  /scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u9-x86-64-MOFED-CHECKER/hpcx_root/src/hpcx-v2.1.0-gcc-MLNX_OFED_LINUX-4.2-1.2.0.0-redhat6.9-x86_64/mxm-v3.7/src/mxm/util/debug/debug.c:616
 4 0x0000000000032510 killpg()  ??:0
 5 0x0000000000129138 __strcmp_sse42()  thread-freeres.c:0
 6 0x0000000000019aa5 ompi_osc_rdma_component_query()  osc_rdma_component.c:0
 7 0x00000000000ace33 ompi_osc_base_select()  ??:0
 8 0x0000000000054811 ompi_win_create_dynamic()  ??:0
 9 0x000000000007a8ea MPI_Win_create_dynamic()  ??:0
10 0x0000000000400c02 main()  /usit/abel/u1/marcink/mpitest/win_bugreport.c:37
11 0x000000000001ed1d __libc_start_main()  ??:0
12 0x0000000000400d31 _start()  ??:0
===================

Same segfault as above, just without any mention of ucx. I guess a different issue alltogether.

@angainor
Copy link

angainor commented May 2, 2018

@hjelmn This time I configured with $ ./configure --prefix=/usit/abel/u1/marcink/software/openmpi.gnu/3.1.0rc5 --with-knem=/usit/abel/u1/marcink/software/hpcx/2.1.0/knem --with-mxm=/usit/abel/u1/marcink/software/hpcx/2.1.0/mxm --with-hcoll=/usit/abel/u1/marcink/software/hpcx/2.1.0/hcoll --with-ucx=/usit/abel/u1/marcink/software/hpcx/2.1.0/ucx --with-platform=contrib/platform/mellanox/optimized --enable-debug

But the --enable-debug didn't seem to do much. Edit In the configure summary it said Debug build: No

@hjelmn
Copy link
Member

hjelmn commented May 2, 2018

Interesting. All the line numbers are garbage. Is there a core file you can look at with gdb?

@angainor
Copy link

angainor commented May 2, 2018

@hjelmn But maybe I should first get a debug build - that one clearly didn't work. Doesn't contrib/platform/mellanox/optimized turn off the debugin somehow? what's this mellanox_debug?

@hjelmn
Copy link
Member

hjelmn commented May 2, 2018

I can't remember the priority. I thought --enable-debug would override any enable_debug=no in a platform file. I would use the debug platform file and see if that gives an actual debug build. You can check by running ompi_info after its installed. That will indicate whether a build has --enable-debug or not. configure also prints out a line at the end when debug is enabled.

@jladd-mlnx
Copy link
Member

@angainor get rid of the --with-platform=contrib/platform/mellanox/optimized flag and keep --enable-debug

@angainor
Copy link

angainor commented May 2, 2018

@hjelmn This is the debug stack:

[compute-2-1:15200:0] Caught signal 11 (Segmentation fault)
==== backtrace ====
 2 0x000000000005a41c mxm_handle_error()  /scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u9-x86-64-MOFED-CHECKER/hpcx_root/src/hpcx-v2.1.0-gcc-MLNX_OFED_LINUX-4.2-1.2.0.0-redhat6.9-x86_64/mxm-v3.7/src/mxm/util/debug/debug.c:641
 3 0x000000000005a58c mxm_error_signal_handler()  /scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u9-x86-64-MOFED-CHECKER/hpcx_root/src/hpcx-v2.1.0-gcc-MLNX_OFED_LINUX-4.2-1.2.0.0-redhat6.9-x86_64/mxm-v3.7/src/mxm/util/debug/debug.c:616
 4 0x0000000000032510 killpg()  ??:0
 5 0x000000000012868a __strcmp_sse42()  thread-freeres.c:0
 6 0x000000000001c6a5 ompi_osc_rdma_query_mtls()  /tmp/marcink/openmpi-3.1.0rc5/ompi/mca/osc/rdma/osc_rdma_component.c:752
 7 0x000000000001c6a5 ompi_osc_rdma_component_query()  /tmp/marcink/openmpi-3.1.0rc5/ompi/mca/osc/rdma/osc_rdma_component.c:372
 8 0x00000000001047d3 ompi_osc_base_select()  ??:0
 9 0x000000000006cbd9 ompi_win_create_dynamic()  /tmp/marcink/openmpi-3.1.0rc5/ompi/win/win.c:341
10 0x00000000000bde08 MPI_Win_create_dynamic()  ??:0
11 0x0000000000400c02 main()  /usit/abel/u1/marcink/mpitest/win_bugreport.c:37
12 0x000000000001ed1d __libc_start_main()  ??:0
13 0x0000000000400d31 _start()  ??:0
===================

@hjelmn
Copy link
Member

hjelmn commented May 2, 2018

Weird. I don't know how the mtl check could be crashing except for memory corruption. Can you run with --mca mtl_base_verbose 100?

@angainor
Copy link

angainor commented May 2, 2018

@hjelmn

$ mpirun -np 2 -mca osc_base_verbose 100 -mca osc rdma,sm --mca mtl_base_verbose 100 ./a.out
[1525293605.770834] [compute-2-1:16101:0]         shm.c:65   MXM  WARN  Could not open the KNEM device file at /dev/knem : No such file or directory. Won't use knem.
[1525293605.770834] [compute-2-1:16100:0]         shm.c:65   MXM  WARN  Could not open the KNEM device file at /dev/knem : No such file or directory. Won't use knem.
[compute-2-1.local:16100] mca: base: components_register: registering framework mtl components
[compute-2-1.local:16100] mca: base: components_register: found loaded component mxm
[compute-2-1.local:16101] mca: base: components_register: registering framework mtl components
[compute-2-1.local:16101] mca: base: components_register: found loaded component mxm
[compute-2-1.local:16100] mca: base: components_register: component mxm register function successful
[compute-2-1.local:16101] mca: base: components_register: component mxm register function successful
[compute-2-1.local:16100] mca: base: components_open: opening mtl components
[compute-2-1.local:16100] mca: base: components_open: found loaded component mxm
[compute-2-1.local:16101] mca: base: components_open: opening mtl components
[compute-2-1.local:16101] mca: base: components_open: found loaded component mxm
[1525293605.775879] [compute-2-1:16101:0]         shm.c:65   MXM  WARN  Could not open the KNEM device file at /dev/knem : No such file or directory. Won't use knem.
[1525293605.775867] [compute-2-1:16100:0]         shm.c:65   MXM  WARN  Could not open the KNEM device file at /dev/knem : No such file or directory. Won't use knem.
[compute-2-1.local:16100] mca: base: components_open: component mxm open function successful
[compute-2-1.local:16101] mca: base: components_open: component mxm open function successful
[compute-2-1.local:16101] mca: base: components_register: registering framework osc components
[compute-2-1.local:16101] mca: base: components_register: found loaded component sm
[compute-2-1.local:16101] mca: base: components_register: component sm has no register or open function
[compute-2-1.local:16100] mca: base: components_register: registering framework osc components
[compute-2-1.local:16100] mca: base: components_register: found loaded component sm
[compute-2-1.local:16100] mca: base: components_register: component sm has no register or open function
[compute-2-1.local:16100] mca: base: components_register: found loaded component rdma
[compute-2-1.local:16101] mca: base: components_register: found loaded component rdma
[compute-2-1.local:16101] mca: base: components_register: component rdma register function successful
[compute-2-1.local:16100] mca: base: components_register: component rdma register function successful
[compute-2-1.local:16101] mca: base: components_open: opening osc components
[compute-2-1.local:16101] mca: base: components_open: found loaded component sm
[compute-2-1.local:16100] mca: base: components_open: opening osc components
[compute-2-1.local:16100] mca: base: components_open: found loaded component sm
[compute-2-1.local:16100] mca: base: components_open: component sm open function successful
[compute-2-1.local:16100] mca: base: components_open: found loaded component rdma
[compute-2-1.local:16101] mca: base: components_open: component sm open function successful
[compute-2-1.local:16101] mca: base: components_open: found loaded component rdma
[compute-2-1.local:16101] mca:base:select: Auto-selecting mtl components
[compute-2-1.local:16101] mca:base:select:(  mtl) Querying component [mxm]
[compute-2-1.local:16101] mca:base:select:(  mtl) Query of component [mxm] set priority to 30
[compute-2-1.local:16101] mca:base:select:(  mtl) Selected component [mxm]
[compute-2-1.local:16101] select: initializing mtl component mxm
[compute-2-1.local:16100] mca:base:select: Auto-selecting mtl components
[compute-2-1.local:16100] mca:base:select:(  mtl) Querying component [mxm]
[compute-2-1.local:16100] mca:base:select:(  mtl) Query of component [mxm] set priority to 30
[compute-2-1.local:16100] mca:base:select:(  mtl) Selected component [mxm]
[compute-2-1.local:16100] select: initializing mtl component mxm
[compute-2-1.local:16101] select: init returned success
[compute-2-1.local:16101] select: component mxm selected
[compute-2-1.local:16100] select: init returned success
[compute-2-1.local:16100] select: component mxm selected
[compute-2-1.local:16101] mca: base: close: component mxm closed
[compute-2-1.local:16101] mca: base: close: unloading component mxm
[compute-2-1.local:16100] mca: base: close: component mxm closed
[compute-2-1.local:16100] mca: base: close: unloading component mxm
[compute-2-1.local:16100] selected btl: openib
[compute-2-1:16101:0] Caught signal 11 (Segmentation fault)
==== backtrace ====
 2 0x000000000005a41c mxm_handle_error()  /scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u9-x86-64-MOFED-CHECKER/hpcx_root/src/hpcx-v2.1.0-gcc-MLNX_OFED_LINUX-4.2-1.2.0.0-redhat6.9-x86_64/mxm-v3.7/src/mxm/util/debug/debug.c:641
 3 0x000000000005a58c mxm_error_signal_handler()  /scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u9-x86-64-MOFED-CHECKER/hpcx_root/src/hpcx-v2.1.0-gcc-MLNX_OFED_LINUX-4.2-1.2.0.0-redhat6.9-x86_64/mxm-v3.7/src/mxm/util/debug/debug.c:616
 4 0x0000000000032510 killpg()  ??:0
 5 0x000000000012868a __strcmp_sse42()  thread-freeres.c:0
 6 0x000000000001c6a5 ompi_osc_rdma_query_mtls()  /tmp/marcink/openmpi-3.1.0rc5/ompi/mca/osc/rdma/osc_rdma_component.c:752
 7 0x000000000001c6a5 ompi_osc_rdma_component_query()  /tmp/marcink/openmpi-3.1.0rc5/ompi/mca/osc/rdma/osc_rdma_component.c:372
 8 0x00000000001047d3 ompi_osc_base_select()  ??:0
 9 0x000000000006cbd9 ompi_win_create_dynamic()  /tmp/marcink/openmpi-3.1.0rc5/ompi/win/win.c:341
10 0x00000000000bde08 MPI_Win_create_dynamic()  ??:0
11 0x0000000000400c02 main()  /usit/abel/u1/marcink/mpitest/win_bugreport.c:37
12 0x000000000001ed1d __libc_start_main()  ??:0
13 0x0000000000400d31 _start()  ??:0
===================

@hjelmn
Copy link
Member

hjelmn commented May 2, 2018

Huh, interesting. I wonder if mxm is causing the issue. Try --mca mtl ^mxm

@hjelmn
Copy link
Member

hjelmn commented May 2, 2018

This is the sequence of events from what I can tell. It is unrelated to this PR. At some point Intel added a check to osc/rdma to disable its use with OPA because verbs over OPA is not recommended. The problem is here we have this sequence:

  1. mtl/mxm is loaded which causes ompi_mtl_base_selected_component to be set
  2. pml/ucx or pml/ob1 (if you use --mca pml ob1) causes pml/cm to get unloaded
  3. mtl/mxm is dlclosed
  4. We try to access ompi_mtl_base_selected_component in osc/rdma
  5. CRASH

*&^$#)($

@angainor
Copy link

angainor commented May 2, 2018

@hjelmn Running in a while true loop, the crash doesn't happen with -mca mtl ^mxm.

So yes, not related to this PR.

@hjelmn
Copy link
Member

hjelmn commented May 2, 2018

Ok, so the problem @angainor reported is unrelated to this PR. PR incoming to fix that problem.

@hjelmn
Copy link
Member

hjelmn commented May 2, 2018

See #5136

@angainor
Copy link

angainor commented May 2, 2018

@bwbarrett @xinzhao3 This PR together with #5135 fixes the osc ucx segfault due to no atomic operations on our ConnectX-3.

@angainor
Copy link

angainor commented May 2, 2018

@xinzhao3 Of course, the segfault is still there if I force -mca osc ucx on this architecture..

@jladd-mlnx
Copy link
Member

Great news!

@hjelmn
Copy link
Member

hjelmn commented May 2, 2018

@jladd-mlnx Yeah, this puts v3.1.0 on track for release and gives time to identify the remaining SEGV for v3.1.1.

@jladd-mlnx jladd-mlnx merged commit 32ddc6a into open-mpi:master May 2, 2018
@jladd-mlnx
Copy link
Member

@xinzhao3 please cherry-pick into 3.1.x branch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants