Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Potential segfault in MPIR_Attr_delete_list #6364

Closed
giordano opened this issue Jan 17, 2023 · 7 comments
Closed

Potential segfault in MPIR_Attr_delete_list #6364

giordano opened this issue Jan 17, 2023 · 7 comments
Labels
need confirm Need verify if the issue still exist with latest code

Comments

@giordano
Copy link
Contributor

giordano commented Jan 17, 2023

In JuliaParallel/MPI.jl#684 we experienced some segmentation faults using MVAPICH 2 on Ookami. This was reported, with a reproducer, in the MVAPICH mailing list and a patch

diff --git a/src/mpi/attr/attrutil.c b/src/mpi/attr/attrutil.c
index b9abbe6..5af452a 100644
--- a/src/mpi/attr/attrutil.c
+++ b/src/mpi/attr/attrutil.c
@@ -266,6 +266,7 @@ int MPIR_Attr_delete_list( int handle, MPID_Attribute **attr )
           corresponding keyval */
        /* Still to do: capture any error returns but continue to
           process attributes */
+    if (p->keyval) {
        mpi_errno = MPIR_Call_attr_delete( handle, p );
​
        /* We must also remove the keyval reference.  If the keyval
@@ -282,6 +283,7 @@ int MPIR_Attr_delete_list( int handle, MPID_Attribute **attr )
                MPIU_Handle_obj_free( &MPID_Keyval_mem, p->keyval );
            }
        }
+    }
​
        MPIU_Handle_obj_free( &MPID_Attr_mem, p );

is available to fix the segmentation fault, which I can confirm is working for me.

I don't have an MPICH build so I couldn't check whether I could reproduce the same segmentation fault with MPICH, but the code of MPIR_Attr_delete_list

/* Routine to delete an attribute list */
int MPIR_Attr_delete_list(int handle, MPIR_Attribute ** attr)
{
MPIR_Attribute *p, *new_p;
int mpi_errno = MPI_SUCCESS;
p = *attr;
while (p) {
/* delete the attribute by first executing the delete routine, if any,
* determine the the next attribute, and recover the attributes
* storage */
new_p = p->next;
/* Check the sentinels first */
/* --BEGIN ERROR HANDLING-- */
if (p->pre_sentinal != 0 || p->post_sentinal != 0) {
MPIR_ERR_SET(mpi_errno, MPI_ERR_OTHER, "**attrsentinal");
/* We could keep trying to free the attributes, but for now
* we'll just bag it */
return mpi_errno;
}
/* --END ERROR HANDLING-- */
/* For this attribute, find the delete function for the
* corresponding keyval */
/* Still to do: capture any error returns but continue to
* process attributes */
mpi_errno = MPIR_Call_attr_delete(handle, p);
/* We must also remove the keyval reference. If the keyval
* was freed earlier (reducing the refcount), the actual
* release and free will happen here. We must free the keyval
* even if the attr delete failed, as we then remove the
* attribute.
*/
{
int in_use;
/* Decrement the use of the keyval */
MPII_Keyval_release_ref(p->keyval, &in_use);
if (!in_use) {
MPIR_Handle_obj_free(&MPII_Keyval_mem, p->keyval);
}
}
MPIR_Handle_obj_free(&MPID_Attr_mem, p);
p = new_p;
}
/* We must zero out the attribute list pointer or we could attempt to use it
* later. This normally can't happen because the communicator usually
* disappears after a call to MPI_Comm_free. But if the attribute keyval
* has an associated delete function that returns an error then we don't
* actually free the communicator despite having freed all the attributes
* associated with the communicator.
*
* This function is also used for Win and Type objects, but the idea is the
* same in those cases as well. */
*attr = NULL;
return mpi_errno;
}
looks more or less the same as the one in MVAPICH, in particular the unguarded use of p->keyval, so you may want to apply a similar patch.

CC: @simonbyrne.

@raffenet
Copy link
Contributor

The reproducer:

#include <mpi.h>
#include <stdlib.h>

int main(int argc, char** argv) {
    // Initialize the MPI environment
    MPI_Init(NULL, NULL);

    MPI_Datatype dup_type;
    MPI_Type_dup(MPI_UINT32_T, &dup_type);

    MPI_Type_commit(&dup_type);

    int keyval;
    MPI_Type_create_keyval(MPI_TYPE_NULL_COPY_FN,
                           MPI_TYPE_NULL_DELETE_FN,
                           &keyval, NULL);

    MPI_Type_set_attr(dup_type, keyval, NULL);

    MPI_Finalize();
    return 0;
}

FWIW, this does not segfault for me with MPICH 4.1rc2. I also configured with AddressSanitizer enabled and did not receive any warnings.

@giordano
Copy link
Contributor Author

For what is worth, I couldn't reproduce with MVAPICH on x86_64 either, the only system where I bumped into the segfault so far is Ookami.

@raffenet
Copy link
Contributor

Got it. I'll give this a try on one of our A64FX nodes and see how it goes. Any special compiler or configuration options?

@raffenet raffenet added the need confirm Need verify if the issue still exist with latest code label Jan 17, 2023
@giordano
Copy link
Contributor Author

giordano commented Jan 17, 2023

Nothing too special, this is the configuration used for MVAPICH:

MVAPICH2 2.3.7 Wed March 02 22:00:00 EST 2022 ch3:mrail

Compilation
CC: gcc    -DNDEBUG -DNVALGRIND -O2
CXX: g++   -DNDEBUG -DNVALGRIND -O2
F77: gfortran -w -fallow-argument-mismatch -O2  -O2
FC: gfortran   -O2

Configuration
--prefix=/lustre/software/mvapich2/gcc12/2.3.7 --with-knem=/opt/knem-1.1.4.90mlnx1 --with-hcoll=/opt/mellanox/hcoll --enable-fortran=all --enable-cxx --with-file-system=lustre --with-slurm=/cm/shared/apps/slurm/current --with-pm=slurm --with-device=ch3:mrail --with-pmi=pmi1 -with-rdma=gen2

Compiled with GCC 12.1

@raffenet
Copy link
Contributor

After looking at the code some, I have a hard time seeing how keyval could be NULL. That said, I suppose this kind of defensive programming couldn't hurt. What do you think @hzhou?

@hzhou
Copy link
Contributor

hzhou commented Jan 18, 2023

I think MVAPICH's patch is due to glitches (or bugs) somewhere else that somehow left invalid entries in the attribute list. We should try to locate the source of the issue (if we are confirmed to have the same issue) and fix the source. The unnecessary defensive code, while does not hurt for the execution, adds confusion for the code maintenance.

@hzhou
Copy link
Contributor

hzhou commented Mar 10, 2023

Close this issue due to non-reproducible. Re-open if it can be reproduced with recent MPICH.

@hzhou hzhou closed this as completed Mar 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
need confirm Need verify if the issue still exist with latest code
Projects
None yet
Development

No branches or pull requests

3 participants