Skip to content

Algorithmic bug in Bcast algorithm 8 #7410

@hunsa

Description

@hunsa

Dear Open MPI team,

I stumbled upon a problem in Bcast algorithm 8 in Open MPI 4.0.2.
see: https://github.com/open-mpi/ompi/blob/master/ompi/mca/coll/base/coll_base_bcast.c

In line 905, the following code starts

                    err = MCA_PML_CALL(recv((char *)buf + (ptrdiff_t)offset * extent,
                                            count - offset, datatype, remote,
                                            MCA_COLL_BASE_TAG_BCAST,
                                            comm, &status));

If I start a Bcast with 10 INTs on 7 nodes with 1 process per node (srun -N 7 --ntask-per-node=1), the code will crash depending on the underlying BTL component.

On vader and openib, the code runs through (interestingly).
On psm2, it crashes like this

Message size 18446744073709551608 bigger than supported by PSM2 API. Max = 4294967296

And indeed, when tracing count-offset from above, the recv call will be posted with count-offset=-2 for rank 5. I was surprised that this actually worked on vader and openib.

count - offset should not become negative. We'll probably just have to filter out (in the algorithm's code) this recv call if the count argument becomes negative, but I'll leave it to the algorithm designer to fix this.

I also add a short MPI_Bcast test code (as I was surprised to see this work on openib, I checked a few cases and outcomes). If started with

srun -N 7 --ntasks-per-node=1 ./bcast_test 10

the following code should crash/crashes on psm2 (as expected).

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <mpi.h>

int my_MPI_Bcast( void *buffer, int count, MPI_Datatype datatype, int root, MPI_Comm comm ) {
  int i;
  int rank, size;
  MPI_Comm_rank(MPI_COMM_WORLD, &rank);
  MPI_Comm_size(MPI_COMM_WORLD, &size);

  if( rank==root ) {
    for(i=1; i<size; i++) {
      MPI_Send(buffer, count, datatype, i, 0, comm);  
    }    
  } else {
    MPI_Recv(buffer, count, datatype, root, 0, comm, MPI_STATUS_IGNORE);
  }
  
  return 0;
}

int main(int argc, char *argv[]) {
  int rank, size;
  int n, i, j;
  
  int *buf;  
  int root = 0, correct, allcorrect;
  int *res;
    
  MPI_Init(&argc, &argv);
  
  MPI_Comm_rank(MPI_COMM_WORLD, &rank);
  MPI_Comm_size(MPI_COMM_WORLD, &size);

  if( argc < 2 ) {
    exit(1);
  }
  
  n = atoi(argv[1]);
  
  buf = (int*)calloc(n, sizeof(int));
  res = (int*)calloc(n, sizeof(int));
  
  if( rank == root ) {
    for(i=0; i<n; i++) {
      buf[i] = i;
    }
  }
  
  my_MPI_Bcast( buf, n, MPI_INT, root, MPI_COMM_WORLD );
  
  memcpy(res, buf, n*sizeof(int));

  if( rank != root ) {
    for(i=0; i<n; i++) buf[i] = 0;
  }
  
  MPI_Bcast( buf, n, MPI_INT, root, MPI_COMM_WORLD );
  
  MPI_Barrier(MPI_COMM_WORLD);
  for(j=1; j<size; j++ ) {
    if( rank == j ) {
      printf("%d: ", rank);
      for(i=0; i<n; i++) {
        printf("%d ", buf[i]);
      }    
      printf("\n");      
    }
    fflush(stdout);
    MPI_Barrier(MPI_COMM_WORLD);
  }
  
  
  correct = 1;
  for(i=0; i<n; i++) {
    if( res[i] != buf[i] ) {
      correct = 0;
      printf("%d: elem %d\n", rank, i);
    }
    fflush(stdout);
  }

  MPI_Allreduce(&correct, &allcorrect, 1, MPI_INT, MPI_MIN, MPI_COMM_WORLD);
  
  if( rank == root ) {
    if( allcorrect == 1 ) {
      printf("correct\n");
    } else {
      printf("incorrect\n");      
    }    
  }  
  
  MPI_Finalize();  
  
  free(buf);
  free(res);
  return 0;
}

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions