You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am using self-compiled Open MPI from the 5.0.x development branch on a standard desktop system with openSUSE Tumbleweed. I have up-to-date Git submodules and have executed autogen.pl before compilation.
In the commit 43b5d8c, the code of ompi_osc_rdma_component_query has been changed to always return OMPI_ERR_RMA_SHARED when shared memory functionality is queried. Before, the function was returning -1. This change, however, leads to unnecessary failures of the component selection in ompi_osc_base_select. The latter function fails when any of the available one-sided communication components produces OMPI_ERR_RMA_SHARED, even though other components would work perfectly fine.
To give an example, I tested compilation and execution of the following program:
#include<mpi.h>#include<stdio.h>intmain (intargc, char*argv[])
{
MPI_Winwin;
int*ptr, nproc, rank, size=sizeof(int), disp=1;
// The processes allocate a continuous shared memory segment.// Each process controls a chunk of the bytesize of one integer.// Each process writes its rank into the shared memory.// The rank-0 process then prints contents of the whole shared memory (= all rank IDs).MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &nproc);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Win_allocate_shared(size, disp, MPI_INFO_NULL, MPI_COMM_WORLD, &ptr, &win);
*ptr=rank;
MPI_Win_fence(0, win);
if (rank==0)
{
for (inti=0; i<nproc; i++)
{
printf("%d ", ptr[i]);
}
printf("\n");
}
MPI_Win_free(&win);
MPI_Finalize();
return0;
}
When I compile the program with the current 5.0.x version and attempt to run it, I get
[yunipher:00000] *** An error occurred in MPI_Win_allocate_shared
[yunipher:00000] *** reported by process [3017211905,0]
[yunipher:00000] *** on communicator MPI_COMM_WORLD
[yunipher:00000] *** MPI_ERR_RMA_SHARED: Memory cannot be shared
[yunipher:00000] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[yunipher:00000] *** and MPI will try to terminate your MPI job as well)
This can be avoided either by using a pre 43b5d8c version of the code, or manually excluding the broken "rdma" osc component ("sm" is then considered alone)
$ mpiexec -n 1 --mca osc ^rdma ./test.x
I believe that the code in ompi_osc_base_select is overreacing. It should not pass through the error status OMPI_ERR_RMA_SHARED from a single component unless all available components are unusable for shared memory.
The text was updated successfully, but these errors were encountered:
I am using self-compiled Open MPI from the 5.0.x development branch on a standard desktop system with openSUSE Tumbleweed. I have up-to-date Git submodules and have executed
autogen.pl
before compilation.In the commit 43b5d8c, the code of
ompi_osc_rdma_component_query
has been changed to always returnOMPI_ERR_RMA_SHARED
when shared memory functionality is queried. Before, the function was returning-1
. This change, however, leads to unnecessary failures of the component selection inompi_osc_base_select
. The latter function fails when any of the available one-sided communication components producesOMPI_ERR_RMA_SHARED
, even though other components would work perfectly fine.To give an example, I tested compilation and execution of the following program:
When I compile the program with the current 5.0.x version and attempt to run it, I get
This can be avoided either by using a pre 43b5d8c version of the code, or manually excluding the broken "rdma" osc component ("sm" is then considered alone)
I believe that the code in
ompi_osc_base_select
is overreacing. It should not pass through the error statusOMPI_ERR_RMA_SHARED
from a single component unless all available components are unusable for shared memory.The text was updated successfully, but these errors were encountered: