Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segmentation fault when opening a file #3695

Closed
wkliao opened this issue Jun 12, 2017 · 6 comments
Closed

Segmentation fault when opening a file #3695

wkliao opened this issue Jun 12, 2017 · 6 comments
Assignees
Labels

Comments

@wkliao
Copy link
Contributor

wkliao commented Jun 12, 2017

I am encountering a Segmentation fault when using a communicator created from MPI_Cart_create in 3D, using OpenMPI version 2.1.0 and Intel C compiler 17.0.0 on a Linux Ubuntu machine.

The gdb trace points to a possible cause at line 914 of file io_ompio_file_open.c

    int coords_tmp[2] = { 0 };

The size of coords_tmp is too small for 3D coordinate communicators while ompio_fh->f_comm->c_topo->mtc.cart->ndims is 3.

Below is the gdb trace and the test program.

Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007f2e18f87b4c in mca_io_ompio_cart_based_grouping (ompio_fh=0x0)
    at io_ompio_file_open.c:968
968	           if ((coords_tmp[1]/ompio_fh->f_init_procs_per_group) ==
(gdb) where
#0  0x00007f2e18f87b4c in mca_io_ompio_cart_based_grouping (ompio_fh=0x0)
    at io_ompio_file_open.c:968
#1  0x00007f2e18f85da8 in ompio_io_ompio_file_open (comm=0x2051850, 
    filename=0x20584b0 "testfile", amode=9, 
    info=0x601540 <ompi_mpi_info_null>, ompio_fh=0x20588d0, 
    use_sharedfp=1 '\001') at io_ompio_file_open.c:204
#2  0x00007f2e18f8585b in mca_io_ompio_file_open (comm=0x2051850, 
    filename=0x20584b0 "testfile", amode=9, 
    info=0x601540 <ompi_mpi_info_null>, fh=0x20584d0)
    at io_ompio_file_open.c:62
#3  0x00007f2e26a9fd88 in mca_io_base_file_select (file=0x20584d0, 
    preferred=0x0) at base/io_base_file_select.c:457
#4  0x00007f2e2696a40e in ompi_file_open (comm=0x2051850, 
    filename=0x400f54 "testfile", amode=9, info=0x601540 <ompi_mpi_info_null>, 
    fh=0x7ffcffb0abc0) at file/file.c:132
#5  0x00007f2e26a54ffe in PMPI_File_open (comm=0x2051850, 
    filename=0x400f54 "testfile", amode=9, info=0x601540 <ompi_mpi_info_null>, 
    fh=0x7ffcffb0abc0) at pfile_open.c:92
#6  0x0000000000400a58 in main (argc=1, argv=0x7ffcffb0acd8) at cart_bug.c:18
#include <stdlib.h>
#include <stdio.h>
#include <mpi.h>

int main(int argc, char **argv)
{
    int nprocs, cart_nprocs, dims[3]={1,1,0}, periods[3]={0,0,0};
    MPI_Comm comm_cart;
    MPI_File fh;

    MPI_Init(&argc,&argv);
    MPI_Comm_size(MPI_COMM_WORLD, &nprocs);

    MPI_Dims_create(nprocs, 3, dims);

    MPI_Cart_create(MPI_COMM_WORLD, 3, dims, periods, 0, &comm_cart);

    MPI_File_open(comm_cart, "testfile", MPI_MODE_CREATE | MPI_MODE_RDWR,
                  MPI_INFO_NULL, &fh);

    MPI_Finalize();
    return 0;
}
@edgargabriel
Copy link
Member

edgargabriel commented Jun 12, 2017 via email

@ggouaillardet
Copy link
Contributor

according to gdb, the issue is that ompio_fh is NULL

@edgargabriel
Copy link
Member

Without having looked into this, I suspect that Wei-keng is correct in his analysis. All of our test cases for the cart-grouping are based on 2-D cartesian topologies, I need to add a 3-D test case (or prevent ompio entering this code section if the current code can not easily be extended to 3-D an higher dimensional topologies).

@wkliao
Copy link
Contributor Author

wkliao commented Jun 13, 2017

My digging is when line 966 calls mca_topo_base_cart_coords() with cart_topo.ndims being 3, but coords_tmp[] has only two elements, the function mca_topo_base_cart_coords is assigning a value to coords_tmp[2] which is out of bound and may cause ompio_fh to become NULL.

966     ompio_fh->f_comm->c_topo->topo.cart.cart_coords (ompio_fh->f_comm, j, cart_topo.ndims, coords_tmp);

In file ompi/mca/topo/base/topo_base_cart_coords.c, line 56 is accessing coords_tmp[2].

 51     for (i = 0;
 52         (i < comm->c_topo->mtc.cart->ndims) && (i < maxdims);
 53         ++i, ++d) {
 54         dim = *d;
 55         remprocs /= dim;
 56         *coords++ = rank / remprocs;
 57         rank %= remprocs;
 58     }

@edgargabriel
Copy link
Member

edgargabriel commented Jun 15, 2017

I have a fix pending on this issue, and I will file PRs for the 2.1.x and 3.0.x for the first part of that.

The longer story: the cartesian grouping based algorithm has unfortunately been left out in the rewrite of the aggregator selection algorithms two years (or so) back . It is called at the wrong place (file_open instead of file_set_view), and it can not deal with any other cart topology than 2-D.

The fix consists of two parts:

  1. remove the function call to cart_based_grouping from file_open. This way, we are not doing any damage.
  2. I updated the cart_based_grouping code to i) have same interfaces as the other algorithms, ii) deal with arbitrary cartesian topologies and iii) be integrated correctly in file_set_view.
    Right now for 3.0.x my goal is just to bring the 1st fix over to avoid the segfault,. I would like to have part 2 sit on master for a while to make sure that we are not breaking something new, although I did fairly exhaustive testing on that.

@edgargabriel edgargabriel self-assigned this Jun 15, 2017
edgargabriel added a commit to edgargabriel/ompi that referenced this issue Jun 16, 2017
the cart_based_grouping aggregator strategy was not correctly updated
during the last major rewrite of the aggregator selection algorithm.
It is also not supposed to be called from file_open (but from
file_set_view).

This fixes an issue reported on the mailing list bei @wkliao issue open-mpi#3695

Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>
edgargabriel added a commit to edgargabriel/ompi that referenced this issue Jun 16, 2017
the cart_based_grouping aggregator strategy was not correctly updated
during the last major rewrite of the aggregator selection algorithm.
It is also not supposed to be called from file_open (but from
file_set_view).

This fixes an issue reported on the mailing list bei @wkliao issue open-mpi#3695

Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>
edgargabriel added a commit to edgargabriel/ompi that referenced this issue Jun 16, 2017
the cart_based_grouping aggregator strategy was not correctly updated
during the last major rewrite of the aggregator selection algorithm.
It is also not supposed to be called from file_open (but from
file_set_view).

This fixes an issue reported on the mailing list bei @wkliao issue open-mpi#3695

Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>
@edgargabriel
Copy link
Member

this issue has been fixed.

pkestene added a commit to pkestene/euler_kokkos that referenced this issue Feb 3, 2019
…(both hdf5 and pnetcdf), there is a bug in OpenMPI version 2.1 (used in Ubuntu 17.10 at least) that is reported here open-mpi/ompi#3695; let's try another version of openmpi
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants