Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OSC/UCX crashes with c_flush from ibm test suite #5117

Closed
hppritcha opened this issue Apr 30, 2018 · 8 comments
Closed

OSC/UCX crashes with c_flush from ibm test suite #5117

hppritcha opened this issue Apr 30, 2018 · 8 comments
Assignees
Labels
Milestone

Comments

@hppritcha
Copy link
Member

The OSC/UCX component on master fails for the c_flush.c test in the ibm test suite. With verbosity set to 100, one sees this error message from UCX"

1525114642.639621] [primavera:3170 :0] ucp_mm.c:264 UCX ERROR Undefined address requires UCP_MEM_MAP_ALLOCATE flag

The problem is it appears the UCX component can't handle NULL buffers being supplied to MPI_Win_create. Here's the test

/* -*- Mode: C; c-basic-offset:4 ; indent-tabs-mode:nil ; -*- */
/*
 *
 *  (C) 2003 by Argonne National Laboratory.
 *      See COPYRIGHT in top-level directory.
 */

#include <stdlib.h>
#include <stdio.h>

#include <mpi.h>

#include "ompitest_error.h"

#define ITER 100

int main( int argc, char *argv[] )
{
    int rank, nproc, i;
    int errors = 0, all_errors = 0;
    int *buf;
    MPI_Win window;

    MPI_Init(&argc, &argv);
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    MPI_Comm_size(MPI_COMM_WORLD, &nproc);

    ompitest_check_size(__FILE__, __LINE__, 2, 1);

    /** Create using MPI_Win_create() **/

    if (rank == 0) {
      MPI_Alloc_mem(sizeof(int), MPI_INFO_NULL, &buf);
      *buf = nproc-1;
    } else
      buf = NULL;

    MPI_Win_create(buf, sizeof(int)*(rank == 0), 1, MPI_INFO_NULL, MPI_COMM_WORLD, &window);

    /* Test flush of an empty epoch */
    MPI_Win_lock(MPI_LOCK_SHARED, 0, 0, window);
    MPI_Win_flush_all(window);
    MPI_Win_unlock(0, window);

    MPI_Barrier(MPI_COMM_WORLD);

    /* Test third-party communication, through rank 0. */
    MPI_Win_lock(MPI_LOCK_SHARED, 0, 0, window);

    for (i = 0; i < ITER; i++) {
        int val = -1, exp = -1;

        /* Processes form a ring.  Process 0 starts first, then passes a token
         * to the right.  Each process, in turn, performs third-party
         * communication via process 0's window. */
        if (rank > 0) {
            MPI_Recv(NULL, 0, MPI_BYTE, rank-1, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
        }

        MPI_Get_accumulate(&rank, 1, MPI_INT, &val, 1, MPI_INT, 0, 0, 1, MPI_INT, MPI_REPLACE, window);
        MPI_Win_flush(0, window);

        exp = (rank + nproc-1) % nproc;

        if (val != exp) {
            printf("%d - Got %d, expected %d\n", rank, val, exp);
            errors++;
        }

        if (rank < nproc-1) {
            MPI_Send(NULL, 0, MPI_BYTE, rank+1, 0, MPI_COMM_WORLD);
        }

        MPI_Barrier(MPI_COMM_WORLD);
    }

    MPI_Win_unlock(0, window);

    MPI_Win_free(&window);
    if (buf) MPI_Free_mem(buf);

    MPI_Reduce(&errors, &all_errors, 1, MPI_INT, MPI_SUM, 0, MPI_COMM_WORLD);

    if (rank == 0 && all_errors == 0)
        printf(" No Errors\n");

    MPI_Finalize();

    return 0;
}

I'm using UCX master at 1785c376beeff9
The problem vanishes if one has all ranks do a MPI_Alloc_mem and supply the returned value to MPI_Win_create.

@hppritcha
Copy link
Member Author

@jladd-mlnx

@hppritcha
Copy link
Member Author

Oh I should add this is on a box where I only have xpmem installed. No mellanox/ib hw.

@xinzhao3
Copy link
Contributor

xinzhao3 commented May 1, 2018

@hppritcha I think this NULL buffer bug is fixed by #5094 , could you try this patch for this test?

@jladd-mlnx jladd-mlnx added this to the v3.1.1 milestone May 1, 2018
@jladd-mlnx jladd-mlnx added the bug label May 1, 2018
@xinzhao3
Copy link
Contributor

xinzhao3 commented May 2, 2018

@hppritcha I runned your code with patch #5094 , looks like the issue is fixed. Could you try it to run again? Thanks!

@hppritcha
Copy link
Member Author

Well I tried with updated master (with #5094 merged) and now I see a new error:

Running test
Running test: mpirun -np 4 ./c_flush

hpp@primavera:~/ompi-tests/ibm/onesided> mpirun -np 4 ./c_flush
[primavera:20850:0:20850] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))
==== backtrace ====
===================
-------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 1 with PID 0 on node primavera exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

and if I look at the coredump I see

Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x0000ffffaef3e8f8 in ucp_get_nbi (ep=0x5be2ef0, buffer=0xfffffd9ce154, length=4, remote_addr=730760800, rkey=0x0)
    at rma/basic_rma.c:389
389	    status = UCP_RKEY_RESOLVE(rkey, ep, rma);
[Current thread is 1 (Thread 0xffffb18b6000 (LWP 20822))]
Missing separate debuginfos, use: zypper install glibc-debuginfo-2.22-3.3.aarch64 libgcc_s1-debuginfo-5.3.1+r233831-7.1.aarch64 libnuma1-debuginfo-2.0.9-10.2.aarch64 libz1-debuginfo-1.2.8-8.2.aarch64
(gdb) bt
#0  0x0000ffffaef3e8f8 in ucp_get_nbi (ep=0x5be2ef0, buffer=0xfffffd9ce154, length=4, remote_addr=730760800, rkey=0x0)
    at rma/basic_rma.c:389
#1  0x0000ffffaec816bc in ompi_osc_ucx_get () from /home/hpp/ompi/install/lib/openmpi/mca_osc_ucx.so
#2  0x0000ffffaec82d44 in ompi_osc_ucx_get_accumulate () from /home/hpp/ompi/install/lib/openmpi/mca_osc_ucx.so
#3  0x0000ffffb180b534 in PMPI_Get_accumulate () from /home/hpp/ompi/install/lib/libmpi.so.0
#4  0x000000000040190c in main (argc=1, argv=0xfffffd9ce2c8) at c_flush.c:60

@hppritcha
Copy link
Member Author

This is with UCX 1.3.0. I still see many failures in the ibm/onesided tests - do you actually test these?

I see many many failures. At least 50% of the tests fail.

I suspect that what's happening is no one is testing UCX with xpmem support only - no infiniband. Most of the log files show output like

Running test
Running test: mpirun -np 4 ./c_strided_putget_indexed
[primavera:21787:0:21787] Caught signal 7 (Bus error: nonexistent physical address)
==== backtrace ====
===================
-------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 1 with PID 0 on node primavera exited on signal 7 (Bus error).
--------------------------------------------------------------------------
Exit status: 135

@jladd-mlnx
Copy link
Member

openucx/ucx#2588

@hppritcha
Copy link
Member Author

this appears to be fixed. closing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants