-
Notifications
You must be signed in to change notification settings - Fork 937
Description
Dear Open MPI team,
I stumbled upon a problem in Bcast algorithm 8 in Open MPI 4.0.2.
see: https://github.com/open-mpi/ompi/blob/master/ompi/mca/coll/base/coll_base_bcast.c
In line 905, the following code starts
err = MCA_PML_CALL(recv((char *)buf + (ptrdiff_t)offset * extent,
count - offset, datatype, remote,
MCA_COLL_BASE_TAG_BCAST,
comm, &status));
If I start a Bcast with 10 INTs on 7 nodes with 1 process per node (srun -N 7 --ntask-per-node=1), the code will crash depending on the underlying BTL component.
On vader and openib, the code runs through (interestingly).
On psm2, it crashes like this
Message size 18446744073709551608 bigger than supported by PSM2 API. Max = 4294967296
And indeed, when tracing count-offset from above, the recv call will be posted with count-offset=-2 for rank 5. I was surprised that this actually worked on vader and openib.
count - offset should not become negative. We'll probably just have to filter out (in the algorithm's code) this recv call if the count argument becomes negative, but I'll leave it to the algorithm designer to fix this.
I also add a short MPI_Bcast test code (as I was surprised to see this work on openib, I checked a few cases and outcomes). If started with
srun -N 7 --ntasks-per-node=1 ./bcast_test 10
the following code should crash/crashes on psm2 (as expected).
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <mpi.h>
int my_MPI_Bcast( void *buffer, int count, MPI_Datatype datatype, int root, MPI_Comm comm ) {
int i;
int rank, size;
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
if( rank==root ) {
for(i=1; i<size; i++) {
MPI_Send(buffer, count, datatype, i, 0, comm);
}
} else {
MPI_Recv(buffer, count, datatype, root, 0, comm, MPI_STATUS_IGNORE);
}
return 0;
}
int main(int argc, char *argv[]) {
int rank, size;
int n, i, j;
int *buf;
int root = 0, correct, allcorrect;
int *res;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
if( argc < 2 ) {
exit(1);
}
n = atoi(argv[1]);
buf = (int*)calloc(n, sizeof(int));
res = (int*)calloc(n, sizeof(int));
if( rank == root ) {
for(i=0; i<n; i++) {
buf[i] = i;
}
}
my_MPI_Bcast( buf, n, MPI_INT, root, MPI_COMM_WORLD );
memcpy(res, buf, n*sizeof(int));
if( rank != root ) {
for(i=0; i<n; i++) buf[i] = 0;
}
MPI_Bcast( buf, n, MPI_INT, root, MPI_COMM_WORLD );
MPI_Barrier(MPI_COMM_WORLD);
for(j=1; j<size; j++ ) {
if( rank == j ) {
printf("%d: ", rank);
for(i=0; i<n; i++) {
printf("%d ", buf[i]);
}
printf("\n");
}
fflush(stdout);
MPI_Barrier(MPI_COMM_WORLD);
}
correct = 1;
for(i=0; i<n; i++) {
if( res[i] != buf[i] ) {
correct = 0;
printf("%d: elem %d\n", rank, i);
}
fflush(stdout);
}
MPI_Allreduce(&correct, &allcorrect, 1, MPI_INT, MPI_MIN, MPI_COMM_WORLD);
if( rank == root ) {
if( allcorrect == 1 ) {
printf("correct\n");
} else {
printf("incorrect\n");
}
}
MPI_Finalize();
free(buf);
free(res);
return 0;
}